In this post, I mentioned a number of large projects one of which is the creation of an Armenian Library online repository of all Armenian books and written content. I want to spend a few minutes spelling out my vision for this and the steps in the project. This pertains to #4 in the above mentioned post. Eventually I plan to move this information to the library website once that is up and running.
Mission: Preserve and promote the Armenian culture and heritage by serving the entire Armenian written literature from a single well designed web site to the global community. Digitize and translate all written works in Armenian language and support others engaged in this endeavor through knowledge sharing, tools, cooperation, centralized repository and other means to accelerate the availability of Armenian content to Armenians and others interested in Armenian literature globally. Translate the entire Armenian literature into every language to make it accessible. Maintain a centralized database of all Armenian words translated into every language to serve the centralized dictionary and thesaurus. Preserve both the original and the textual representation of every work enabling viewing and reading on every modern device and indexing by every search engine. Preserve all Armenian translations of non-Armenian literature. Respect and support the copyright of authors and promote the creation of new works in Armenian. (further improvement of the mission needed – comments welcome)
Vision: A global organization of all interested supporters unified around the mission developing, enhancing and populating a centralized database and Internet and other applications for purposes of serving all content to as many mediums as possible. Crowd sourced translations of all of the books served in every language first translated through automated machine translation but improved through human contributions of all Armenians with foreign language skills and others with Armenian proficiency. Server of all digitized Armenian books for all modern devices in many book formats for easy reading. Checked and improved content over time through crowd sourcing. Holder of copyright grant from as many authors and their families as possible but mostly a depository of works in the public domain. A large endowment for maintenance of the Armenian Library project in perpetuity. (again needs to be improved – comments welcome)
The basic steps are outlined below.
- Build a small proof of concept which takes 1 page from A to Z.
- Take a book from A to Z.
- Take a small personal library from A to Z.
During this process:
- Document the time it takes to perform all required steps.
- Develop the designs and diagrams for all equipment, software, databases, applications.
- After a small personal library is available online, build a prototype of the equipment for another site.
- Run from A to Z at another site by another person.
- Improve the prototype and application and designs in the process.
Repeat above with a third site. After this solicit participation from major Armenian libraries and other sources of literature. Accomplish above in less than 5 years (oh how much I wish it could be less than 1 year but that’s not possible without a small army of full-time staff).
The process of taking a book from A to Z is below:
- A. Enter material (book) metadata into a database (most likely MySQL) – initially through text entry, later through an application.
- B. Research the copyright situation.
- C. Scan the book using a book scanner from diybookscanner.org
- D. Process scanned images into a book (formats to be determined)
- E. Run OCR routines using Tesseract (needs to be trained for the various fonts of Armenian)
- F. Inspect and improve the OCR output (the perceived weakest link so far – need to find a great solution)
- G. Parse the book text into words.
- H. Populate dictionary database with all new words.
- I. Find literal translation of new words from online and offline sources.
- J. Populate translation database (to make it cheaper to translate future books – ie lookup Google once for example not once for every book).
- K. Machine translate the book into every language (realizing this will be a very poor translation but at least it’s a starting point).
- L. Convert book text into various output formats (Kindle, Apple iBooks, nook, etc.)
- M. Post book online (based on copyright situation) for viewing and editing of the versions. (here I envision an original scanned version along with original text where text output can be crowd improved – find a wrong word based on the picture, correct it – result stored as version).
- N. Review all equipment, tools and software for improvement (no need to waste time every time, just improve as we go).
- X. Say Հայր մեր որ յերկինս ես…
- Y. Return the book to its place.
- Z. Go to step A for another book.
If you see anything missing or to be improved, please make a comment. This is not a simple project or a small project. It will require the time, talent and treasure of many to become a reality but it is a very important project. Some quick math: 1 book per hour per person means it’ll take 9600 years to finish the job or about 10 years if we have 1,000 people working on this. These 1,000 people need to live somehow which means it’ll take about $60 million to just scan the books. Of course, if I can somehow cut the time of scanning from 1 hour to 30 minutes, well then it would take only take 5 years to get there (from the day everything worked like clockwork). There’s also the ongoing maintenance and support of the facilities (in this case the web infrastructure) which can be estimated fairly easily but initially at least it would be worth using a provider such as Amazon AWS or another cloud hosting facility. Bottom line, without money and time this will never happen. The current fragmented approaches are not an option. I know when we unite around our survival, we make incredible history and hope that this can snowball into a similar united movement to make the preservation and accessibility of our the Armenian literature a reality.