Archive for the ‘datamining’ Category
Announcement: DataCite Summer Meeting – Data and the Scholarly Record: the Changing Landscape
DataCite will hold its second Summer Meeting on August 24th and 25th at the historic Shattuck Plaza Hotel in Berkeley, California. The Summer Meeting will be a 1.5 day event and you can register at: http://datacite2011.eventbrite.com/ .
The Summer Meeting brings together people from research organisations, data centers, government, and information service providers to hear about the latest developments in data science, data citation, discovery, and reuse. It also provides opportunities to exchange experience and influence the next generation of data citation services.
This year’s program will include sessions on data citation, data publishing, and discussions on the new challenges that come with increased access to scientific data.
The 2010 DataCite summer meeting brought together a strong programme of speakers and participants (http://www.datacite.org/datacite_summer_meeting_2010). Highlights were published in D-Lib (http://dx.doi.org/doi:10.1045/january2011-contents).
DataCite helps researchers find, access, and reuse data. It is an international not-for-profit association founded in 2009 with members across the globe.
CFP: DDI Workshop: Managing Metadata for Longitudinal Data – Best Practices
DDI Workshop: Managing Metadata for Longitudinal Data – Best Practices
September, 19-23, 2011
Leibniz Center for Informatics, Schloss Dagstuhl, Wadern, Germany
Goals
This symposium-style workshop will bring together representatives from major longitudinal data collection efforts to share expertise and to explore the use of the DDI metadata standard as a means of managing and structuring longitudinal study documentation. Participants will work collaboratively to create best practices for documenting longitudinal data in its various forms, including panel data and repeated cross-sections.
Description of the workshop
Longitudinal survey data carry special challenges related to documenting and managing data over time, over geography, and across multiple languages. This complexity is often a barrier to building efficient systems for data access and analysis. DDI (Data Documentation Initiative) Lifecyle, a metadata standard that addresses the full life cycle of social science research data (formerly referred to as DDI 3), is designed to provide an efficient structure for the documentation of complex longitudinal data. In this workshop, participants involved in longitudinal data projects around the world will work together on issues involved in documenting longitudinal data.
Intended audience: Individuals with expertise in longitudinal social science data; knowledge of DDI is desired but not required. The intent is to have a mix of participants with substantive and technical skills. Participants should provide access to materials describing their projects, which can serve as use cases in applying DDI. The workshop is in English. This is the second Dagstuhl workshop on the topic; the first took place in October 2010. The upcoming workshop will continue the in-depth discussion begun last year, expanding into additional topics.
Expected Results
Participants will write best practice papers, to be published in the DDI Working Paper Working Paper Series. Last year’s workshop produced a series of best practice papers on longitudinal data.
Possible Topics
Documenting comparison, harmonization, and the relationship among concepts, questions, and variables over time, as well as the relationship of respondent types (person, household) are typical issues for longitudinal data. Other topics not specific to longitudinal data:
- Classifications (e.g., ISCO, ISCED)
- Data collection details
- Qualitative data, other types of data sources beyond surveys
- Quality of metadata and data
- Data management planning
- Relationship to the Open Archival Information System (OAIS)
- Extension of DDI for specific needs
These topics are often more salient for longitudinal data, making it even more critical manage these metadata in a structured form over time and countries. The current possibilities of DDI Lifecyle will be explored and areas for future extensions identified. Additionally, participants can suggest their area of interest.
Venue
The workshop will take place at the Leibniz Center for Informatics, Schloss Dagstuhl, Wadern, Germany. The non-profit center is a member of the Leibniz Association and is funded jointly by the German federal government and a number of state governments. The venue provides an intense working atmosphere in a nice remote region. Several seminar rooms and cafeteria while the day, and leisure rooms like wine bar and billiard room while the evening promote intense discussion and communication. Accommodation costs at Dagstuhl including full board is 60 Euro/day/person (subsidized rate).
Sponsors
This workshop is sponsored by the DDI Alliance, GESIS – Leibniz Institute for the Social Sciences, Minnesota Population Center (MPC), and Open Data Foundation (ODaF).
Contact
The names of interested organizations and individuals should be sent to ddi-expert-workshop@icpsr.umich.edu. Please provide contact information, area of interest, and area of expertise for each individual, information regarding DDI Lifecyle implementation, and a statement of what each individual can contribute to the workshop. Direct questions to ddi-expert-workshop@icpsr.umich.edu. Twenty-one participants will be accepted.
Links
Related Web page: http://www.dagstuhl.de/11382
Best practice papers on longitudinal data: http://www.ddialliance.org/resources/publications/working/BestPractices/LongitudinalData
DDI Working Paper Working Paper Series: http://www.ddialliance.org/resources/publications/working
Further information on “How to get to Dagstuhl”: http://www.dagstuhl.de/en/about-dagstuhl/arrival/
Pictures of Dagstuhl: http://www.dagstuhl.de/en/about-dagstuhl/press/downloads/
DDI Alliance: http://www.ddialliance.org/
GESIS – Leibniz Institute for the Social Sciences: http://www.gesis.org/
Minnesota Population Center (MPC): http://www.pop.umn.edu/
Open Data Foundation (ODaF): http://www.opendatafoundation.org/
The organizers would appreciate hearing soon from interested people.
Mary Vardigan, Director DDI Alliance
Wendy Thomas, Chair DDI Technical Implementation Committee
Joachim Wackerow, Vice Chair DDI Technical Implementation Committee
Arofan Gregory, Technical Consultant
(Organizers)
GESIS – Leibniz Institute for the Social Sciences
Department: Monitoring Society and Social Change
Unit: Social Science Metadata Standards
Visiting address: B2 1, 68159 Mannheim, Germany
Postal address: P.O. Box 122155, 68072 Mannheim, Germany
Phone: +49 (0)621 1246 262
Fax: +49 (0)621 1246 100
E-mail: joachim.wackerow@gesis.org
www.gesis.org/en/institute/
Data Documentation Initiative 3 (DDI 3) Data Extraction Tools from Colectica Awarded an NIH Grant
The Data Documentation Initiative 3 (DDI 3) standard is a simply fabulous and full standard for metadata (data about data) as well as for the data contents, making it a full payload standard.
DDI 3 is such an exciting standard because it allows for the possibility of true and full computational support for data harmonization and for really working with longitudinal data. It’s the type of data standard I’d been waiting for because it gets it. Data standards need to be able to support documenting, containing, expressing, and computing (analysis, harmonization, limitations on disclosure, everything we now do with less than ideal systems and methods). DDI 3 does this and that’s why groups like ICPSR are already using it. DDI 3 is already on its way to becoming ubiquitous, but more tools for it are needed.
News of others using and supporting DDI 3 is always good. Thus, it’s wonderful news that Colectica has been awarded an NIH Grant for DDI 3-based data extraction tools. From the Colectica website:
The award is a Phase I grant that provides supplemental support of Algenta’s research on an “Open Standards-Based Data Extraction Web Tool for Complex Longitudinal Datasets”. This Phase I feasibility study aims to analyze to data preparation and metadata creation workflow needed to prepare a study for online data extraction, to validate the use of the Data Documentation Initiative’s DDI 3 standard for the basis of such a tool, and to create prototype web-based data extraction software. While the focus is on longitudinal surveys, the proposed system would also handle cross-sectional, time-series, and non-repeated studies. The aim is to improve research methodologies through a simplification of the process used for discovering, retrieving, and analyzing data relevant to a researcher’s investigation and to improve data citations, aiding in reproducible research. The research includes consultation with researchers from ICPSR at the University of Michigan-Ann Arbor and the Mid-Life in the United States Longitudinal Study at the University of Wisconsin-Madison.
Historical Data Sets
“Coding Early Naturalists’ Accounts into Long-Term Fish Community Changes in the Adriatic Sea (1800–2000)” is a new article in PLoS One by researchers who mined historical data from materials found in archives to understand “fish communities’ changes over the past centuries” which “has important implications for conservation policy and marine resource management” (Abstract). The article explains the difficulty in integrating qualitative and anecdotal historical data with modern data, and their methodology for coding historical data. This article is a great example of how historical materials from archives support all research areas, including current and future scientific research. With so many archives and cultural heritage institutions digitizing their historical materials, more materials are readily accessible online and allow for mining and coding of historical data.
Search Engine Optimization
Now that the University of Florida Digital Collections is optimized for internal coding, we’re trying to start optimizing for search engines. We currently use robots.txt to request that search engines do not crawl our site. Doing so was a hard choice because we want our materials to be accessible and used. However, we were forced to stop the search engines because they were crashing our server. We had a number of overzealous search engines that crawled and re-crawled, and crawled in strange ways. With our JPG2000 images, the over-crawling and overly quick crawling ate too much memory and we couldn’t do it and remain functional. This overcrawling happened even with a site map and all of the proper webmaster configurations. Because the normal right way wasn’t working, we’ve chosen a secondary right way. We hope that this method works until we can make the normal right way work.
We’re currently in the process of building a separate single-page for every item in the collection, and we’ll create these weekly until the normal search indexing works. These pages will live on www.uflib.ufl.edu/ufdc2 as opposed to our real site www.uflib.ufl.edu/ufdc. These pages will have the basic information for each item and the links will go over to the main site (UFDC). By allowing search engines to crawl and index the information on UFDC2, we hope that the search engines will include our information so that site will be more findable without creating huge server memory drains.
We’re not sure what the search engine problems were exactly, just that the engines (from multiple companies) were overcrawling. The University of Florida has an internal Google search appliance. Theoretically – and I haven’t read anything on this, but I would appreciate more information if anyone can help – Google’s main bots and UF’s instance could have simultaneously crawled, driving up their apparent traffic. However, this doesn’t explain why multiple search engines were overcrawling even with a validated sitemap in use.
Most of the information online explains issues with deep folder hierarchies, dynamic URLs, and masses of pages, but there doesn’t seem to be an easy solution. We’re hoping UFDC2 serves as a solution for now. In the meantime, if anyone has recommendations for other options that have worked for search engine optimization of deep websites, and especially for digital libraries with millions of pages, please let me know (via comments or email).
Also in the meantime, search engines should start crawling UFDC2, and the static pages will be finishing building later today. We’re hoping this works!
100,000 pages a month
The University of Florida Digital Collections are still relatively young, established separately only recently. Since March 23 of this year, we’ve added another 100,000 pages, up from 1.62 million on March 23 and now we’re at 1.718 million (and counting) and it’s only April 20. The full stats–as of today–are: 53,682 titles; 70,323 items; and 1,718,050 pages. Our statistics are dynamically updated, listed online here, and the statistics are broken down by collection.
The statistics are a handy gauge of how our collections are developing, but they can’t reflect the quality of materials online. For reflecting a more complete sense of the materials online, new items are shown on a regular (daily to every few days) on the “browse new items” view, available here and dynamically updated with new items.
Codework : Opening Keynote Ted Nelson
I’m currently at the Center for Literary Studies (CLC) Codework: Exploring relations between creative writing practices and software engineering workshop, sponsored by the National Science Foundation, held at West Virginia University (and it’s April 3-6, 2008 and there’s more on it here). Ted Nelson, coiner of the word hypertext and media studies visionary spoke. Sandy Baldwin opened by introducing Nelson – describing Nelson as a luminary, and having him speak as astronomical – and then describing how Nelson influenced his own English practice and work.
Nelson began by explaining his preference for open ended speaking, and then introduced his new book-in-progress “geeks bearing gifts” on the false rhetoric surrounding current software. Nelson continued on, explaining that current software and applications aren’t about technology, but are really packages of conventions selected by someone, with an agenda, and mentioned OOXML as an example, that he’s been fascinated with making a document system and not the fake paper simulators we have now, and he showed latest version of the Xanadu Project (xanarama.net). Nelson’s reputation as a visionary and a great speaker are well earned, so well earned that I stopped taking notes after realizing that my notes would not do his presentation justice in the slightest. I believe the presentation was recorded, though, so once that’s posted I’ll add a link to it.