Archive for the 'deep web' Category

RSS Feeds for the University of Florida’s Digital Collections

Laurie N. Taylor June 24th, 2008

In our ongoing work to improve the findability of books in the UF Digital Collections (UFDC), we now have an RSS page with feeds for each of the collections. The RSS feed page is http://www.uflib.ufl.edu/ufdc2/rss/.

Please sign up for a feed or two to learn about the great materials added daily, and please share the RSS feeds with others!

RSS Feeds, Coming Soon!

Laurie N. Taylor June 18th, 2008

In addition to our UFDC search engine optimization, we’re working on RSS feeds for all new items and for new items from each of the collections. Our RSS feed page will be here: http://www.uflib.ufl.edu/ufdc2/rss/ but it’s still in development right now. RSS feeds take advantage of the power of the web to syndicate and share content and the methods search engines use for ranking content. While this has been arguably problematic as traditional media takes its time in changing, using RSS feeds makes sense and especially so for sites that the University of Florida Digital Collections where we want to share content as widely and completely as possible. Hopefully this post will soon be followed by others on the active RSS feed!

Search Engine Optimization

Laurie N. Taylor June 17th, 2008

Now that the University of Florida Digital Collections is optimized for internal coding, we’re trying to start optimizing for search engines. We currently use robots.txt to request that search engines do not crawl our site. Doing so was a hard choice because we want our materials to be accessible and used. However, we were forced to stop the search engines because they were crashing our server.  We had a number of overzealous search engines that crawled and re-crawled, and crawled in strange ways. With our JPG2000 images, the over-crawling and overly quick crawling ate too much memory and we couldn’t do it and remain functional. This overcrawling happened even with a site map and all of the proper webmaster configurations. Because the normal right way wasn’t working, we’ve chosen a secondary right way. We hope that this method works until we can make the normal right way work.

We’re currently in the process of building a separate single-page for every item in the collection, and we’ll create these weekly until the normal search indexing works. These pages will live on www.uflib.ufl.edu/ufdc2 as opposed to our real site www.uflib.ufl.edu/ufdc. These pages will have the basic information for each item and the links will go over to the main site (UFDC). By allowing search engines to crawl and index the information on UFDC2, we hope that the search engines will include our information so that site will be more findable without creating huge server memory drains.

We’re not sure what the search engine problems were exactly, just that the engines (from multiple companies) were overcrawling. The University of Florida has an internal Google  search appliance. Theoretically - and I haven’t read anything on this, but I would appreciate more information if anyone can help - Google’s main bots and UF’s instance could have simultaneously crawled, driving up their apparent traffic. However, this doesn’t explain why multiple search engines were overcrawling even with a validated sitemap in use.

Most of the information online explains issues with deep folder hierarchies, dynamic URLs, and masses of pages, but there doesn’t seem to be an easy solution. We’re hoping UFDC2 serves as a solution for now. In the meantime, if anyone has recommendations for other options that have worked for search engine optimization of deep websites, and especially for digital libraries with millions of pages, please let me know (via comments or email).

Also in the meantime, search engines should start crawling UFDC2, and the static pages will be finishing building later today. We’re hoping this works!