RSS Advisory Board

How to Read an RSS Feed with Java Using XOM

Cover of the O'Reilly book XML in a Nutshell, 3rd Edition, by Elliotte Rusty Harold and W. Scott Means. The cover features a black-and-white illustration of the neck and head of a peafowl bird with a crest of tipped feathers on his head.

There are a lot of libraries for processing XML data with Java that can be used to read RSS feeds. One of the best is the open source library XOM created by the computer book author Elliotte Rusty Harold.

As he wrote one of his 20 books about Java and XML, Harold got so frustrated with the available Java libraries for XML that he created his own. XOM, which stands for XML Object Model, was designed to be easy to learn while still being strict about XML, requiring documents that are well-formed and utilize namespaces in complete adherence to the specification. (At the RSS Advisory Board, talk of following a spec is our love language.)

XOM was introduced in 2002 and is currently up to version 1.3.9, though all versions have remained compatible since 1.0. To use XOM, download the class library in one of the packages available on the XOM homepage. You can avoid needing any further configuration by choosing one of the options that includes third-party JAR files in the download. This allows XOM to use an included SAX parser under the hood to process XML.

Here's Java code that loads items from The Guardian's RSS 2.0 feed containing articles by Ben Hammersley, displaying them as HTML output:

// create an XML builder and load the feed using a URL Builder bob = new Builder(); Document doc = bob.build("https://www.theguardian.com/profile/benhammersley/rss"); // load the root element and channel Element rss = doc.getRootElement(); Element channel = rss.getFirstChildElement("channel"); // load all items in the channel Elements items = channel.getChildElements("item"); for (Element item : items) { // load elements of the item String title = item.getFirstChildElement("title").getValue(); String author = item.getFirstChildElement("creator", "http://purl.org/dc/elements/1.1/").getValue(); String description = item.getFirstChildElement("description").getValue(); // display the output System.out.println("<h2>" + title + "</h2>"); System.out.println("<p><b>By " + author + "</b></p>"); System.out.println("<p>" + description + "</p>"); }

All of the classes used in this code are in the top-level package nu.xom, which has comprehensive JavaDoc describing their use. Like all Java code this is a little long-winded, but Harold's class names do a good job of explaining what they do. A Builder uses its build() method with a URL as the argument to load a feed into a Document over the web. There are also other build methods to load a feed from a file, reader, input stream, or string.

Elements can be retrieved by their names such as "title", "link" or "description". An element with only one child of a specific type can be retrieved using the getFirstChildElement() method with the name as the argument:

Element linkElement = item.getFirstChildElement("link");

An element containing multiple children of the same type uses getChildElements() instead:

Elements enclosures = item.getChildElements("enclosure"); if (enclosures.size() > 1) { System.out.println("I'm pretty sure an item should only include one enclosure"); }

If an element is in a namespace, there must be a second argument providing the namespace URI. Like many RSS feeds, the ones from The Guardian use a dc:creator element from Dublin Core to credit the item's author. That namespace has the URI "http://purl.org/dc/elements/1.1/".

If the element specified in getFirstChildElement() or getChild Elements() is not present, those methods return null. You may need to check for this when adapting the code to load other RSS feeds.

If the name Ben Hammersley sounds familiar, he coined the term "podcasting" in his February 2004 article for The Guardian about the new phenomenon of delivering audio files in RSS feeds.

Posted by Rogers Cadenhead at 2023/08/01 11:25 PM | 0 COMMENTS | permalink

The RSS Advisory Board Just Turned 20

A photo of the actor Leonardo Dicaprio as Jay Gatsby holding up a celebratory glass of champagne — *"Tomorrow we will run faster, stretch out our arms farther."*

Today is the 20th birthday of the RSS Advisory Board, the group that publishes the RSS specification. It was formed on July 18, 2003, when the copyright of the specification was transferred to Harvard University, which immediately released it under a Creative Commons license and deferred all matters related to RSS to the new board.

At the time of the board's launch, here's how the founding members described its purpose:

Is the advisory board a standards body?

No. It will not create new formats and protocols. It will encourage and help developers who wish to use RSS 2.0. Since the format is extensible, there are many ways to add to it, while remaining compatible with the RSS 2.0 specification. We will help people who wish to do so.

What does the advisory board actually do?

We answer questions, write tech notes, advocate for RSS, make minor changes to the spec per the roadmap, help people use the technology, maintain a directory of compatible applications, accept contributions from community members, and otherwise do what we can to help people and organizations be successful with RSS.

This remains the purpose 140 dog years later. In addition to maintaining the current RSS specification, we are the official publisher of Netscape's RSS 0.90 and RSS 0.91 specifications and Yahoo's Media RSS specification.

We also offer an RSS Validator and RSS Best Practices Profile containing our recommendations for how to implement the format.

There's a resurgence of interest in RSS today as people discover the exhilarating freedom of the open web. Some of this is due to dissatisfaction with deleterious changes at big social sites like Twitter and Reddit. Some is due to satisfaction with Mastodon, a decentralized social network owned by nobody with more than one million active users. As long as there are social media gatekeepers using engagement algorithms to decide what you can and can't see, there will be a need to get around them. When someone offers an RSS or Atom feed and you subscribe to it in a reader, you get their latest updates without manipulation.

Here's to another 20 years of feeding readers, unlocking gates, helping developers adopt RSS and repeatedly getting asked the question, "Can an RSS item contain more than one enclosure?"

Posted by Rogers Cadenhead at 2023/07/18 03:50 PM | 0 COMMENTS | permalink

Downloading 50,000 Podcast Feeds to Analyze Their RSS

The software developer Niko Abeler has crawled 51,165 podcast feeds to study what RSS elements they contain. His comprehensive Podcast Feed Standard report looks at the usage of core RSS elements and namespace elements from Apple iTunes, Atom, Content, Podcast 2.0 and Simple Chapters. He writes:

In the world of podcasting, there is a great deal of freedom when it comes to the format and content of a podcast. Creators are free to choose their own audio format and feed content, giving them the flexibility to create something truly unique. However, when it comes to distributing a podcast, certain standards must be followed in order to be added to an aggregator such as Apple Podcasts. Additionally, the podcasting community has come to agree upon certain conventions that can be used to add additional features to a podcast, such as chapters, enhanced audio, and more. These conventions allow for a more immersive and engaging listening experience for the audience.

This website is dedicated to providing guidance and information on the conventions and standards used in podcasting.

There's a lot of interesting data in the RSS 2.0 report, which finds that these are the six least popular elements in an RSS feed's channel:

Element	Usage
docs	8.3%
cloud	0.0%
rating	0.0%
skipDays	0.0%
skipHours	0.0%
textInput	0.0%

Over 99 percent of feeds contain the optional channel element language and the optional item elements enclosure, guid, pubDate and title. Only 0.2% of feeds contain a source element in an item.

The iTunes namespace report shows a lot of variation in support. The required element itunes:explicit is only present in 18 percent of feeds and four optional elements have less than 20 percent: itunes:new-feed-url, itunes:block, itunes:complete and itunes:title. One namespace in the report, Podcast 2.0, has been proposed by Podcastindex "to provide a solution for problems which previously have been solved by multiple competing standards" and is still under development.

The report also analyzes the audio files enclosed in the podcast feeds to determine their format, bitrate, channel and loudness. The report finds that 95.6 percent use MP3 and 4.4 percent AAC/M4A. People who like an alternative open source format will be oggravated that its sliver of the pie graph is so small it can't be seen.

If Abeler isn't tired of crunching numbers, one thing that would be useful for the RSS Advisory Board to learn is how many of the feeds contain more than one enclosure element within a single item.

Posted by Rogers Cadenhead at 2023/07/14 10:38 AM | 2 COMMENTS | permalink

Tara Calishain Explains: What is RSS?

The exodus of users away from Twitter and Reddit has led many of those information refugees to discover the joy of subscribing to feeds in a reader. RSS and Atom feeds are an enormous open decentralized network that can never be ruined under new ownership -- because there's no owner.

Tara Calishain of ResearchBuzz has written a 4,000-word introduction to RSS for people who are new to the world of feeds:

I could not do ResearchBuzz without RSS feeds. They're invaluable. And I think if you learn more about them, you'll appreciate why I consider RSS the most underrated tech on the Internet. That's what this article is about: I'm going to explain what RSS feeds are, show you how to find them, go over some of the RSS feed readers available, and, finally, list several tools and resources you might find useful on your journey.

... I follow over a thousand RSS feeds which deliver information to me throughout the day. Do you think I could visit a thousand websites a day to check for new information? Even if I tried to visit a thousand a week that would be over 142 websites a day. Assuming it took me two minutes to visit a site and check for new content, I would spend over 4.5 hours a day just visiting websites.

Do you see why I'm so grateful for RSS?

Calishain, who was blogging before Netscape created RSS in 1999, covers a lot more than the basics, showing how to find hidden feeds on websites, check a bunch of feeds for freshness and create keyword-based feeds to search sites like Google News, Hacker News and WordPress. Even experienced readers of readers will learn new things, and there's a collection of nine handy RSS Gizmos she has developed.

On that subject, Calishain just began programming a year ago:

In spring 2022 I decided to find out if I could really learn JavaScript after being diagnosed as autistic. (I'm a high school dropout and didn't think I could learn something like programming.)

I CAN! And I LOVE IT!

Welcome to the not-so-secret society of programmers, Tara! Please slow down a little. You're making the rest of us look bad.

Posted by Rogers Cadenhead at 2023/07/12 09:45 PM | 0 COMMENTS | permalink

Be Unique And Use RSS Guid Like Everybody Else

Black and white photo of 12 snowflakes from the Library of Congress taken by Theodor Horydczak to illustrate that all snowflakes are unique — *Winter scenes: Snowflakes by Theodor Horydczak*

If you publish an RSS feed, you should do a solid for the developers of RSS readers by including a guid in each item. The guid's job is to be a unique identifier that helps software downloading your feed decide whether it has seen that item before. Here's the guid for an item on the arts and technology blog Laughing Squid:

<guid isPermaLink="false">https://laughingsquid.com/?p=914660</guid>

No other item on Laughing Squid will ever have this guid value. It's a URL that loads a blog post with the title Playful Elephant Pretends to Eat Woman's Hat. If you load the guid's URL https://laughingsquid.com/?p=914660, it redirects to the permanent link of the post. Because the guid is not the permanent link, there's an isPermaLink attribute with a value of false.

Most guid values in RSS feeds are the permanent link of the item, as in this example from the world news site Semafor:

<guid>https://www.semafor.com/article/07/07/2023/us-jobs-data-what-experts-make-of-the-new-numbers</guid>

A drawback of using the permalink is that if any part of the URL changes -- such as the title text or the domain name -- the guid changes and RSS readers will think this is a new item to show the feed's subscribers, when it's actually a repeat.

A guid doesn't have to be a URL. It can be any string that the feed publisher has chosen to be unique. Here's the guid from the RSS Advisory Board's feed for this blog post:

<guid isPermaLink="false">tag:rssboard.org,2006:weblog.217</guid>

Our guid follows the TAG URI scheme, a simple way to assure uniqueness by putting these five components together in this order:

The text "tag"
A domain owned by the feed provider
A year the provider owned that domain
A short name for the feed different from any other feed on the site
The internal ID number of the post

There's different punctuation between each component. The year 2006 was when the board began using the domain rssboard.org. No one else used that domain that year, so any feed reader that stores "tag:rssboard.org,2006:weblog.217" as this item's guid should never encounter that value in any other item on any other feed.

To see how RSS 2.0 feeds are using guid, several thousand feeds were downloaded this evening from an RSS aggregator that publicly shares the OPML subscription lists of its users.

Category	Total	Percentage
Total number of feeds	4,954	--
Feed using guid	4,777	96.4%
Feeds using non-permalinks in guid	752	15.2%

The term guid means "globally unique identifier," but RSS 2.0 does not require global uniqueness in guids. Because the TAG URI scheme does a good job of serving that purpose, Blogger, Flickr, MetaFilter, SoundCloud and The Register are among the sites using it in their feeds.

Posted by Rogers Cadenhead at 2023/07/10 10:30 PM | 0 COMMENTS | permalink

Main Menu

RSS-Public Mailing List

RSS Language Codes

RSSCloud Interface

RSS History

RSS Feeds

RSS Advisory Board

Really Simple Syndication specifications, tutorials and discussion

Popular Pages on This Site