WARC Collection Summarization

A rich source of U.S. data covering demographics, economy, geography, and more.
Post Reply
aminaas1576
Posts: 560
Joined: Mon Dec 23, 2024 3:35 am

WARC Collection Summarization

Post by aminaas1576 »

In this final session of the Internet Archive’s digital humanities expo, Library as Laboratory, attendees heard from scholars in a series of short presentations about their research and how they’re using collections and infrastructure from the Internet Archive for their work.
Speakers:

Forgotten Histories of the Mid-Century Coding Bootcamp, [watch] Kate Miltner (University of Edinburgh)
Japan As They Saw It, [watch] Tom Gally (University of Tokyo)
The Bibliography of Life, [watch] Rod Page (University of Glasgow)
Q&A 1 [watch]
More Than Words: Fed Chairs’ Communication During Congressional Testimonies, [watch] Michelle Alexopoulos (University of Toronto)
WARC Collection Summarization, [watch] Sawood Alam (Internet Archive)
Automatic scanning with an Internet Archive TT scanner, [watch] Art Rhyno (University of Windsor)
Q&A 2 [watch]
Automated Hashtag Hierarchy Generation Using Community phone number library Detection and the Shannon Diversity Index, [watch] Spencer Torene (Thomson Reuters Special Services, LLC)
My Internet Archive Enabled Journey As A Digital Humanities Citizen Scientist, [watch] Jim Salmons
Web and cities: (early internet) geographies through the lenses of the Internet Archive, [watch] Emmanouil Tranos (University of Bristol)
Forgotten Novels of the 19th Century, [watch] Tom Gally (University of Tokyo)
Q&A 3 [watch]
Links shared during the session are available in the series Resource Guide.

Sawood Alam (Internet Archive)

Items in the Internet Archive’s Petabox collections of various media types like image, video, audio, book, etc. have rich metadata, representative thumbnails, and interactive hero elements. However, web collections, primarily containing WARC files and their corresponding CDX files, often look opaque. We created an open-source CLI tool called “CDX Summary” [1] to process sorted CDX files and generate reports.
Post Reply