Laurie N. Taylor July 28th, 2008
I’ve been so busy the past year (or 14 months to be completely accurate) since joining UF’s Digital Library Center that it’s hard to see what all we’ve accomplished. The time has flown by with loads of wonderful work, and wonderful progress. I decided to review some of our documentation and to note a few of the highlights:
- More stuff! We hit the 1 million page mark in September 2007, and as of today we’re at 2.12 million with so many more to load!
- More types of stuff! Improvements to UFDC that include support for audio and video files, better multi-language support!
- Better ways to see the stuff! Optimized code for a faster UFDC, thumbnails for new all book images for faster quick-viewing, a better interface for usability!
- Better connections to find stuff! Optimizing UFDC for search engines so we’re crawled properly, created RSS feeds for the collections within UFDC, set up external accounts to share content and to connect users to UFDC (this blog, our Flickr account, our YouTube presence, Wikipedia links for items and entries on authors, books, people, and places related to the collections connecting context with actual items).
- More work to tell people about our stuff! Multiple presentations internally and at national and international conferences, interns, class tours, working with faculty, students, staff, and organizations to tell them about UFDC and to show them how it can help their work. We made exhibits, contributed digital materials to exhibits and other events and publications, and worked with the Libraries’ Public Information Officer to write and distribute press releases and other materials.
- More projects to keep going! Working with other groups at the UF Libraries for particular collections, including: Retrospective Dissertation Scanning; Marjorie Kinnan Rawlings Collection; Romanies Collection; Gainesville Bands; British Parliamentary Debates; Asia Collection; Women in Development; and many more, including further developing existing collections like the Florida Digital Newspaper Library and the Digital Library of the Caribbean (dLOC) with partners at the UF Libraries, at UF, and elsewhere. In addition to projects based on partners, we’ve also defined some projects chronologically with grant and time-based projects and this we’ve finished some grants, started new ones, applied for others, are preparing to apply for even others, and migrating some of our older projects from other technology to UFDC.
All of this and much more happened during the past year, but the Digital Library Center has been around since 1999 so it all grows from that ongoing work. That’s still the more recent history because the Digital Library Center grew out of the Preservation Department (founded in 1987, I think, based on the “News from the Preservation Office” newsletters now online in UFDC). By 1993, the Preservation Department was already looking toward a comprehensive method for preservation, around the same time that the Mosaic browser was helping generate interest in the World Wide Web, heralding the promise of the digital revolution to come. There’s so much more to the history and the future of the Digital Library Center, but it’s too much to try to put in one blog post so it’ll have to wait for later.
- flickr , statistics , ufdc , seo , digitalcollections , newspapers , archives , Collection Items , dloc , uf , preservation , Digital Library
Laurie N. Taylor June 24th, 2008
In our ongoing work to improve the findability of books in the UF Digital Collections (UFDC), we now have an RSS page with feeds for each of the collections. The RSS feed page is http://www.uflib.ufl.edu/ufdc2/rss/.
Please sign up for a feed or two to learn about the great materials added daily, and please share the RSS feeds with others!
Laurie N. Taylor June 18th, 2008
In addition to our UFDC search engine optimization, we’re working on RSS feeds for all new items and for new items from each of the collections. Our RSS feed page will be here: http://www.uflib.ufl.edu/ufdc2/rss/ but it’s still in development right now. RSS feeds take advantage of the power of the web to syndicate and share content and the methods search engines use for ranking content. While this has been arguably problematic as traditional media takes its time in changing, using RSS feeds makes sense and especially so for sites that the University of Florida Digital Collections where we want to share content as widely and completely as possible. Hopefully this post will soon be followed by others on the active RSS feed!
Laurie N. Taylor June 17th, 2008
Now that the University of Florida Digital Collections is optimized for internal coding, we’re trying to start optimizing for search engines. We currently use robots.txt to request that search engines do not crawl our site. Doing so was a hard choice because we want our materials to be accessible and used. However, we were forced to stop the search engines because they were crashing our server. We had a number of overzealous search engines that crawled and re-crawled, and crawled in strange ways. With our JPG2000 images, the over-crawling and overly quick crawling ate too much memory and we couldn’t do it and remain functional. This overcrawling happened even with a site map and all of the proper webmaster configurations. Because the normal right way wasn’t working, we’ve chosen a secondary right way. We hope that this method works until we can make the normal right way work.
We’re currently in the process of building a separate single-page for every item in the collection, and we’ll create these weekly until the normal search indexing works. These pages will live on www.uflib.ufl.edu/ufdc2 as opposed to our real site www.uflib.ufl.edu/ufdc. These pages will have the basic information for each item and the links will go over to the main site (UFDC). By allowing search engines to crawl and index the information on UFDC2, we hope that the search engines will include our information so that site will be more findable without creating huge server memory drains.
We’re not sure what the search engine problems were exactly, just that the engines (from multiple companies) were overcrawling. The University of Florida has an internal Google search appliance. Theoretically - and I haven’t read anything on this, but I would appreciate more information if anyone can help - Google’s main bots and UF’s instance could have simultaneously crawled, driving up their apparent traffic. However, this doesn’t explain why multiple search engines were overcrawling even with a validated sitemap in use.
Most of the information online explains issues with deep folder hierarchies, dynamic URLs, and masses of pages, but there doesn’t seem to be an easy solution. We’re hoping UFDC2 serves as a solution for now. In the meantime, if anyone has recommendations for other options that have worked for search engine optimization of deep websites, and especially for digital libraries with millions of pages, please let me know (via comments or email).
Also in the meantime, search engines should start crawling UFDC2, and the static pages will be finishing building later today. We’re hoping this works!