Search Engine Optimization
Laurie N. Taylor on Jun 17th 2008
Now that the University of Florida Digital Collections is optimized for internal coding, we’re trying to start optimizing for search engines. We currently use robots.txt to request that search engines do not crawl our site. Doing so was a hard choice because we want our materials to be accessible and used. However, we were forced to stop the search engines because they were crashing our server. We had a number of overzealous search engines that crawled and re-crawled, and crawled in strange ways. With our JPG2000 images, the over-crawling and overly quick crawling ate too much memory and we couldn’t do it and remain functional. This overcrawling happened even with a site map and all of the proper webmaster configurations. Because the normal right way wasn’t working, we’ve chosen a secondary right way. We hope that this method works until we can make the normal right way work.
We’re currently in the process of building a separate single-page for every item in the collection, and we’ll create these weekly until the normal search indexing works. These pages will live on www.uflib.ufl.edu/ufdc2 as opposed to our real site www.uflib.ufl.edu/ufdc. These pages will have the basic information for each item and the links will go over to the main site (UFDC). By allowing search engines to crawl and index the information on UFDC2, we hope that the search engines will include our information so that site will be more findable without creating huge server memory drains.
We’re not sure what the search engine problems were exactly, just that the engines (from multiple companies) were overcrawling. The University of Florida has an internal Google search appliance. Theoretically - and I haven’t read anything on this, but I would appreciate more information if anyone can help - Google’s main bots and UF’s instance could have simultaneously crawled, driving up their apparent traffic. However, this doesn’t explain why multiple search engines were overcrawling even with a validated sitemap in use.
Most of the information online explains issues with deep folder hierarchies, dynamic URLs, and masses of pages, but there doesn’t seem to be an easy solution. We’re hoping UFDC2 serves as a solution for now. In the meantime, if anyone has recommendations for other options that have worked for search engine optimization of deep websites, and especially for digital libraries with millions of pages, please let me know (via comments or email).
Also in the meantime, search engines should start crawling UFDC2, and the static pages will be finishing building later today. We’re hoping this works!
Filed in Digital Library, UFDC, access, datamining, deep web, digital collections, seo | No responses yet
100,000 pages a month
Laurie N. Taylor on Apr 20th 2008
The University of Florida Digital Collections are still relatively young, established separately only recently. Since March 23 of this year, we’ve added another 100,000 pages, up from 1.62 million on March 23 and now we’re at 1.718 million (and counting) and it’s only April 20. The full stats–as of today–are: 53,682 titles; 70,323 items; and 1,718,050 pages. Our statistics are dynamically updated, listed online here, and the statistics are broken down by collection.
The statistics are a handy gauge of how our collections are developing, but they can’t reflect the quality of materials online. For reflecting a more complete sense of the materials online, new items are shown on a regular (daily to every few days) on the “browse new items” view, available here and dynamically updated with new items.
Filed in Collection Items, Digital Library, datamining, statistics | 2 responses so far
Codework : Opening Keynote Ted Nelson
Laurie N. Taylor on Apr 4th 2008
I’m currently at the Center for Literary Studies (CLC) Codework: Exploring relations between creative writing practices and software engineering workshop, sponsored by the National Science Foundation, held at West Virginia University (and it’s April 3-6, 2008 and there’s more on it here). Ted Nelson, coiner of the word hypertext and media studies visionary spoke. Sandy Baldwin opened by introducing Nelson - describing Nelson as a luminary, and having him speak as astronomical - and then describing how Nelson influenced his own English practice and work.
Nelson began by explaining his preference for open ended speaking, and then introduced his new book-in-progress “geeks bearing gifts” on the false rhetoric surrounding current software. Nelson continued on, explaining that current software and applications aren’t about technology, but are really packages of conventions selected by someone, with an agenda, and mentioned OOXML as an example, that he’s been fascinated with making a document system and not the fake paper simulators we have now, and he showed latest version of the Xanadu Project (xanarama.net). Nelson’s reputation as a visionary and a great speaker are well earned, so well earned that I stopped taking notes after realizing that my notes would not do his presentation justice in the slightest. I believe the presentation was recorded, though, so once that’s posted I’ll add a link to it.
Filed in Academia, data sets, datamining, interface, visualization | No responses yet