richardmdavis

More on the rise and rise of academic blogs

Posted by: Richard M. Davis on: 9th July, 2009

Andy McGregor drew my attention to Michael Nielsen’s recent blog post (article?), Is scientific publishing about to be disrupted?. Michael convincingly analyses the disruption of the news publishing industry by online news and blogging, and moves on in a similar way to consider scientific publishing. Michael reminds us that “more and more blogs contain high quality research content”.

If you’re reading this, I may be preaching to the converted, but in the interests of invoking authority and experience (like Chaucer’s Wife of Bath) we can add this to a growing number of assertions to this effect. As previously mentioned, Peter Murray Rust’s views on the importance of blogging (and therefore of blog preservation), are worth repeating:

Blogs are evolving and being used for many valuable activities (here we highlight scholarship). Some bloggers spend hours or more on a popst. Bill Hooker has an incredible set of statistics about the cost of Open Access and Toll Access publications, page charges, etc. Normally that would get published in a journal no-one reads (I have even published in such it was a huge effort and it’s got one citation. Not that I care about citations). So I tend to work out my half-baked ideas in public. Some people do their early science in the Open. Some are activists. Some review the current landscape, etc.

And in a similar vein, Heather Morrison, in her First Monday article Rethinking collections – Libraries and librarians in an open age describes her experience:

Many of my most important contributions to the debates surrounding open access, for example, are posted to the Imaginary Journal of Poetic Economics, or to a listserv. These contributions may or may not be included in peer–reviewed literature at a later date.

If libraries focus solely on collecting peer–reviewed or formally published literature and not blogs and listservs, some of my best writings, and some of the ideas contained there and not expressed elsewhere, are likely to be lost.

I expect I’ll find many more opinions about this over the next few months, so this post will probably have a sequel. Back to Michael Nielsen, in the meantime, who also touches on the issue of collecting and preserving this valuable blog content:

It would be easy to build upon the open source WordPress platform [adding] important features [...] like reliable signing of posts, timestamping, human-readable URLs, and support for multiple post versions, with the ability to see (and cite) a full revision history. [...] Perhaps most importantly, blog posts could be made fully citable.

Encouraging words for this project. And WordPress-based plugin/theme solutions to many of Michael’s suggestions are already available in at least embryonic form – and they are GPL too. I’m looking forward to pulling some of them together into ArchivePress.

Our first month

Posted by: Richard M. Davis on: 8th July, 2009

June was ArchivePress Month 1, and already it’s hard to keep up, particularly with the online buzz. We’ve attracted a modicum of interest in the Twitosphere:

We’ve also had some highly useful discussions about the project on the JISC-PoWR blog and at Peter Murray-Rust’s blog. Among the things I’ve learned from them is that:

  • We have to continue to make our scope and use cases clear, particularly with regard to distinguishing our approach from crawling/spidering/harvesting. Creating local static copies of HTML renderings is the daddy of web archiving approaches, but our thesis is: TIMTOWTDI.
  • We’re not alone in thinking that blogs merit being treated differently from ‘traditional’ websites, and that this (setting a blog to catch a blog) might be a worthwhile idea/approach. but there are bridges to cross, notably the comments-harvesting.
  • Throughout academia – teaching, learning, research and administration – blogs are going from strength to strength. It would be a crime not to ensure they are preserved for future research.

July is the month when I get on with our first demonstrator, AP1, and record the process and the review the results. And IWMW2009 and Enduring Web at BL. And the JISCRI Projects startup meeting I’m sitting in right now.

Which blogs should be preserved?

Posted by: Richard M. Davis on: 26th June, 2009

You’d think it obvious that my blog should be preserved, though I’m not so sure about yours! According to the poster summarising the fascinating 2007 survey by Carolyn Hank et al: “The majority of bloggers agreed (36%) or strongly agreed (34.9%) that their own blogs should be preserved.” Five per cent don’t want their blogs preserved at all; nearly a quarter aren’t fussed either way.

Here’s one of the data tables (which I had to retype as HTML – Peter Murray Rust is right about PDFs and data):

Table 4. Preservation perceptions – general

Strongly agree or agree Neither agree or (sic) disagree Strongly disagree or disagree
Should preserve Personal blog 70.9% 23.8% 5.3%
Every blog 35.8% 27.9% 36.3%
Every comment 31.4% 31.9% 36.7%
All online content 28.2% 22.3% 49.5%
Should not preserve Some blogs 44.7% 27.7% 27.7%
Some comments 48.4% 31.3% 20.2%
Some online content 51.3% 24.9% 23.8%

The overall pattern seems a good vindication of  our own project approach, which will progressively move from capturing blog content (posts), to addressing comments and content, reflecting the scale of the bloggers’ own priorities.

It also seems a useful juncture in our project to throw open the question: which blogs should we preserve?

With over 5 million active blogs noted by Technorati, it seems daft to even start to enumerate them but in our field (libraries, archives, information science), several stand out, and it’s the very nature and importance of these that bolster the case for keeping them. I have in mind in particular Peter Suber’s Open Access News blog, but also blogs such as those of Peter Murray Rust, Brian Kelly, Lorcan Dempsey, Dorothea Salo, Jill Walker Rettberg – all ripe with contemporary accounts and robust views on matters of scholarly communication. But in every case, we have cause to wonder: will that information survive, will that link still work tomorrow?

What blogs (or types of blogs) do you think should be preserved, and why?

ArchivePress: When One Size Doesn’t Fit All

Posted by: Richard M. Davis on: 24th June, 2009

ArchivePress (logo)JISC-PoWR has discussed many times how best to preserve blogs for future use. No one should be in any doubt any more that there are rapidly growing corpora of blogs that contain valuable information or commentary – scholarly, actual, political, or personal – which merit keeping no less than famous and not-so-famous journals and diaries of the past.

Yet, as we discovered in JISC-PoWR, few institutions have truly incorporated web archiving into their overall records and asset-management systems, let alone recognised the specific value of blog content (or even of using blogging to replace traditional approaches to reporting and minuting). Perhaps it just seems too complicated. For those that want to, the only tools that seem to be readily available are specialised tools – like Web Curator Tool and PANDAS – that utilise crawlers like Heritrix and HTTrack to copy websites by harvesting the HTML framework, and following hyperlinks to gather further embedded or linked content. The result might typically be a bunch of ARC/WARC files (a file format specifically designed to encapsulate the results of web crawls), containing snapshots of the browser-oriented rendering of web resources. For many web resources, especially static pages, this is sufficient. When it comes to blogs, though, the archived results seem a bit too static – as I noted in an earlier JISC-PoWR post.

Treating blogs only as web pages overlooks the fact that they are derived from rich, dynamic data sources, and are usually databases themselves. An archive of blogs should allow us to do exactly the same kind of selection as on a live blog: selecting posts by author, date, category, tag. And since a blog is structured data, isn’t the underlying data a more appropriate target for long-term preservation, rather than endless, often duplicate copies of just one particular view of that data?

So what if, instead, the archiving tool were a bit of software already in use, or at least widely used, supported and understood? And Open Source, naturally.

This is the premise behind ArchivePress, a new JISC-funded project being undertaken by ULCC and the British Library. It is a ‘proof-of-concept’ project to progressively explore the implications and possibilities of using newsfeeds and blogging software – WordPress, of course – to capture and archive blog content dynamically, as it happens. ArchivePress will demonstrate the use of feed-reading aggregation to populate the database automatically with posts, comments and embedded content. The result will be a working model of a WordPress installation, with extra plugins, which can be easily setup by any institution to harvest content from blogs they have an interest in. We’ll continue our association with UKOLN, who, along with Lincoln University and the Digital Curation Centre, have agreed to let us use some of their blogs in our development and testing.

In some respects there seems nothing terribly new to anyone already adept with blogs, newsfeeds and newsreaders – except that this appears to be the first attempt to exploit them to create accessible, managed collections of blog posts, with the potential to meet the more exacting requirements of archives and records management, such as reliability and authenticity. Even organisations that have a single mandated blog platform may wish to consider this approach to preserving their blog content. ArchivePress might also be of value to other blog-based activities, from local-history projects, to school and college blogs.

ArchivePress has its own website and blog, that will build a cumulative picture of its results and the issues it encounters over the next 6 months. It wouldn’t have been possible without JISC-PoWR, and we hope it will complement that work. Please check it out and add it to your feed reader. We have a great team who will be contributing, including Maureen Pennock (ex-UKOLN, now at British Library) and Ed Pinsent (UKWAC and JISC-PoWR) – and we even plan to squeeze some guest posts out of web preservation alumni. I’ll also be talking about ArchivePress at the Missing Links workshop at the British Library in July.

ArchivePress at the British Library

Posted by: Richard M. Davis on: 19th June, 2009

There will be a presentation about the ArchivePress project, its background and aims, as part of the forthcoming JISC, DPC and UK Web Archiving Consortium Workshop: Missing Links: the Enduring Web, July 21st at the British Library. For full information about the event see the Digital Preservation Coalition website.

Project startup meeting (15th June 2009)

Posted by: Richard M. Davis on: 19th June, 2009

We held the first ArchivePress team meeting at ULCC on Monday, to review the project plan and objectives.

The plan described in the Project Proposal still seems essentially reasonable and achievable. The project will have three main iterations, each dealing with a different corpus of blogs and with different technical and functional issues.

In Phase One (AP-1), we will simply use FeedWordPress to gather the content from the three blogs of the Digital Curation Centre. This will allow us to examine the results and flag issues for the next phase. Initial guidance on installing and configuring the software will be prepared.

Phases AP-2 and AP-3. will address, respectively, the issues of harvesting comments associated with blog posts, and gathering embedded objects (images, etc). Both Lincoln University and UKOLN have provisionally agreed that we can harvest their various blog outputs as part of this process.

The starting point of the AP approach is the hypothesis that collecting the content of the newsfeeds from blogs may be sufficient for many likely requirements of blog archiving. This doesn’t mean necessarily that it is a fool-proof or instant solution, and the intention of the project is to determine, through practical investigation, how effective this approach is, what are its strengths and limitations.

We have a number of dissemination opportunities available. I am already confirmed as a speaker at The Enduring Web (BL, Tuesday 21st July), and at UKOLN’s IWMW 2009 (University of Essex, Tuesday 28th July). Other imminent opportunities include IWAW 2009 (at ECDL in Corfu, 30th Sept – 1st Oct), iPRES 2009 (San Francisco, Oct 5th – 6th) and IIPC at iPRES 2009 (Oct 7th). A proposal has already been submitted to iPRES. In addition, ULCC hopes to launch its AIDA digital preservation toolkit shortly, with a programme of DP events, and there may be scope to represent AP there.

This blog will be the central source of information about the project, and a place to publish our findings, and discuss what we are doing, how, and why. We hope to follow the successful model of the JISC-PoWR blog and encourage discussion from colleagues in the field, and maybe even some guest posts from eminent digital preservationists.

Maureen is going to focus her attention on user requirements and expectations, including legal, ethical and ownership issues, and the possible use cases, such as academic institutions, thematic collections, or local history projects: she will discuss her thoughts in another post. Ed will assess the relative merits of the AP approach and the web crawler approach of other web archiving endeavours, from the perspectives of both records-management and usability.

I will manage the WordPress configuration and customisation in the early phases, and expect to call on Rory’s help and advice for any advanced PHP development requirements. An environment is being set up on Google Code to support development work in due course.

My next tasks are to prepare the formal plan for our JISC Programme Manager, James Farnhill; and start configuring a WordPress installation for AP-1 – more on that in due course.

Open Repositories 2009

Posted by: Richard M. Davis on: 10th June, 2009

Georgia Aquarium by Driek Heesakkers on Flickr (CC:by-nc-sa)Less than three weeks have passed since I found myself at Open Repositories 2009 (#OR09) in Atlanta, and it already seems a long time ago. For the record, Georgia Tech put on an excellent show, overflowing with fascinating presentations, people and ideas – far too many to take in – and (most importantly) an excellent and entertaining dinner at the Georgia Aquarium.

I took a smashing poster describing our work on Linnean Online and the SNEEP extensions for EPrints, and also spoke about these projects to the EPrints User Group sessions and had to endure the now inevitable Minute Madness. I was pleased to spot the SNEEP Comments plugin in use when Jessie Hey demonstrated EdShare, another of Southampton’s learning resource repository projects. It was also great to meet up again with Patrick McSweeney who has been tweaking SNEEP at Southampton, and discuss ways of keeping ongoing work on the plugins in sync. Regular readers may remember Patrick from OR08, and he cut an even more unforgettable figure this time.

The talk of the event seemed to be the relentless buzz around the unification of DSpace/Fedora Commons, engendering the new creation that is DuraSpace (and DuraCloud). This offers a lot of exciting possibilities that we’ll need to keep track of, though it won’t be the first repositories event that has offered us a surfeit of jam tomorrow… For now, for the curious, here’s the Duraspace FAQ.

By contrast, it’s slightly disappointing that, over the water, the EPrints user group seemed a tad under-subscribed. Features available in EPrints 3.1.x, and those imminent for 3.2, from cloud storage controllers and desktop folder visualisations to preservation support, promise quick wins for anyone wanting to push the repository model further: Les and the EPrints team waste no time in responding to the latest demands of the zeitgeist. All the same, informal discussions with users and non-users of EPrints suggested substantial resistance to its Perl-based core. Yet EPrints continues to push more configurability away from its Perl source: in the kind of repository-driven future oft foretold – from WordPress-type exensibility to modular service-oriented solutions – the underlying code base ought to become increasingly irrelevant as long as the package does what it says on the tin.

As usual it was great to meet some old friends, and lots of people for the first time. Memorably serendipitous (re-)discoveries included:

  • Bibapp – “a Campus Research Gateway and Expert Finder”. There have been many attempts to integrate personalised, portfolio pages with repositories, and this looks like an effort worth investigating further, particularly as it claims to be repository neutral (and a good excuse to try out Ruby for real?).
  • ParallelArchive – another variant on the repository model: “a personal scholarly workspace, a collaborative research environment, and a digital repository”. Run by Open Society Archives (OSA) at Central European University in Budapest – of particular interest to students of cold war and related issues
  • E-Lis – still a superb multilingual collection of LIS resources, and undoubtedly the acid test of all EPrints internationalisation efforts
  • MIT Open CourseWare – the mother of all OERs?
  • The great Peter Sefton – great to meet him at last, at 6′ 7″, someone I can truly look up to. For a much more thorough account of the conference, see Pete’s Blog

I didn’t manage anything in the way of sightseeing, though the Aquarium seemed to be top of most locals’ list of recommendations, and we went there. Perhaps I should have made more of an effort to see the Civil War museum. For the visual record of OR09, content and context, you might like to see Jim Downing’s photos from the event, and the official photo OR09 set on Flickr.

What is the Library of the Future?

Posted by: Richard M. Davis on: 10th April, 2009

The New Biblioteca Alexandrina by Julian Pierre on Flickr (CC:by-nc)

Last Thursday’s Libraries Of The Future (LOTF) event at Oxford University has been well covered elsewhere, so I’ll just note a few key themes as I inferred them. LOTF is a JISC-sponsored campaign begun last year, and continued by means of online social networking (chez Ning) and a JISCInvolve blog, as well as F2F events like this.

If you’d nodded off for even only a couple of years, LOTF might well seem like the product of some post-information-meltdown future, with live Twitter streams projected behind the speakers, and the occasional glimpse of the parallel Second Life auditorium. All the same, one thing we didn’t see, in Real Life or Second Life, were shelves of books, or any books at all for that matter, apart from the JISC Collections catalogue on the registration desk. (By the way, if anyone knows of any shelf-for-shelf library reconstructions iSL, please send me a link.)

On change: Sarah Thomas astutely observed that change only looks fast if you are standing still: the only way to manage it is to be part of it. She proffered sound advice that the past and future – old ways and new ways – should not be set in opposition: we’ll achieve best results by understanding and integrating the best of traditional and innovative approaches to information management. Chris Batt compared the LOTF meeting with a 15th Century gathering of the Society of Scribes, discussing the emergence of “something big”. On that occasion, the “smart” ones, as Chris observed, bit the bullet and set about designing typefaces or investing in their own presses, leaving the rest to soldier on in an ever more marginalised business. Read the rest of this entry »

Set a blog to catch a blog…

Posted by: Richard M. Davis on: 23rd March, 2009

Originally published on the JISC-PoWR blog.

Much discussion of blog preservation focuses on how to preserve the blogness of blogs: how can we make a web archive store, manage and deliver preserved blogs in a way that is faithful to the original?

Nesting...

Since it is blogging applications that provide this stucture and behaviour (usually from simple database tables of Posts, Comments, Users, etc), perhaps we should consider making blogging software behave more like an archive. How difficult would that be? Do we need to hire a developer?

One interesting thing about WordPress is the number of uses its simple blog model has been put to. Under-the-hood it is based on a remarkably simple data base schema of about 10 tables and a suite of PHP scripts, functions and libraries that provide the interface to that data. Its huge user-base has contributed a wide variety of themes and additional functions. It can be turned into a Twitter-like microblog (P2 and Prologue) or a fully-fledged social network (WordPress MU, Buddypress).

Another possibility exploited by a 3rd-party plugin is that of using WordPress as an aggregating blog, collecting posts automatically via RSS from other blogs: this seems like a promising basis for starting to develop an archive of blogs, in a blog.

Read the rest of this entry »

If you can keep your blog when all around…

Posted by: Richard M. Davis on: 20th March, 2009

I was a keen participant in the activities of ERPANET , but I must confess I haven’t kept abreast of its successor, Digital Preservation Europe (DPE). However I was interested to see the recent DPE briefing paper about blog preservation, since it covers an area that we also tackled in the course of the JISC-PoWR project – on the blog , in the workshops and the handbook. The Briefing Paper highlights key issues for those who would preserve blogs. It is a necessarily general overview, and manages to cram a lot of preservation issues into its two sides of A4. But, for the blogger approaching preservation, or the preservationist approaching blogs, I wonder if such avalanches of considerations aren’t sometimes unnecessarily overwhelming. It seemed worth looking at a few of the points made in the DPE briefing paper, and considering whether we can demystify them or make the task seem less daunting.

Read the rest of this entry »

Twitter

Flickr Photos

How tall am I I wonder?

My pod

Kolya

More Photos
Follow

Get every new post delivered to your Inbox.