A Guide to Digitalisation: Going Paperless with AI

Going paperless with AI. Discover how to combine Python and Azure cognitive services. A guide to digitalisation.

The paperless office has been predicted since at least the dawn of desktop computers in the 1970s. Now, with sustainability concerns increasingly a business priority, dreams of the office of the future may finally be realised. Using cloud infrastructures like Azure and AI technology, going paperless is now more realistic than ever. Concerns over data availability and security can be allayed, digitization techniques are much more reliable and information management has reached a level of sophistication that leaves paper sorting in the dust.

For this guide, we’ve teamed up with cyber services experts DGS to learn more about their own digitization processes using Python and Azure. We’ll look at the benefits and run through some vital digitization best practices.

Key Benefits to Adopting a Paperless Strategy

First let’s look at some of the key advantages of going paperless.

Sustainability

Perhaps the most obvious sustainability problem with paper is the resource use in its manufacture. Most paper is made from trees and deforestation is reckoned to account for around 10% of global warming. Trees can be replanted of course, but the effects take time to normalise. Even recycled paper has significant environmental costs in the production process. Going paperless removes these harms entirely.

Time saving

The volume of data in modern business means paper documents are just not practical anymore. Even the most nimble administrator cannot compete with the speed and reach of digital search techniques. By combining comprehensive metadata with AI-powered search, digital resources can be found near-instantaneously, even with limited information.

Less error and data loss

Physical document processing is always risky. Papers can easily be mislaid and duplicates are not always available. Handwritten notes are especially subject to mistakes as well as misreading. However, digital resources can be readily backed up to numerous destinations and intelligent validity checks can quickly flag up human errors.

Security

Businesses are rightly concerned about digital security. But it’s too easy to forget how vulnerable physical resources are to breaches of security. Documents that are mislaid or inappropriately taken from secure locations risk revealing sensitive data. However, modern data security standards can now provide excellent protection against malicious access attempts.

Step by Step: Digitizing a Physical Library with Python and Azure

When digitizing a physical library, you have many tools at your disposal. It is important, however, to choose those that can work efficiently and intelligently. Modern digitization techniques go much further than simple OCR (optical character recognition). Using Python to leverage Azure AI services, you can take advantage of semantic analysis to extract, summarise and categorise documents.

Using Azure Cognitive Services

Azure Cognitive Services comprise a suite of tools that leverage four key areas of intelligent processing: vision, speech, language and decision-making. They also include the Azure OpenAI service. They open up access to the most up-to-date AI techniques to developers without their needing to be experts in the field. The services are accessed through REST APIs and client library SDKs using popular languages like Python. Using these services, you can build computer vision, speech and textual analysis into your apps.

For document digitization, the first stage is to read text from images. For this, you can make use of Azure’s vision APIs in combination with its NLP (natural language processing) services. The NLP service offers:

Language detection – to determine the text’s language (English, Italian, German, etc.).
Key phrase extraction – to isolate the main thematic points of the text.
Sentiment analysis – to identify the overall tone (e.g. optimistic, pessimistic, etc.).
Named entity recognition – to pick out references to people, places, dates and other named entities.
Entity linking – to contextualise the extracted entities with reference to online sources.

With these tools, more in-depth digital reading is possible than with basic OCR (optical character recognition). OCR technologies have been around for many years, but have always had their limitations. Without some level of semantic understanding, it can be difficult to determine textual content that is not clearly reproduced and consistently ordered. This was a particular challenge for DGS in digitizing a corpus of art journals dating back to the nineteenth century. The preservation of text order was made challenging by the presence of document noise, embedded images, a columnar format and other difficulties. However, semantic analysis allowed clearer extraction of coherent texts as well as a much richer set of information for further processing.

From Python to SQL Server

By identifying certain regularities in the textual structure, more accurate digital reading of documents is possible. To see how this works, let’s look again at DGS’s journal digitization project. DGS found that features could be extracted programmatically using Python with Azure’s AI data. A very simple example is the positioning of page titles and numbers, which are always in the first and last place. This allowed the following simple extraction of titles.

The result of these Python scripts was a document dictionary in JSON format, e.g.:

doc_json= { 'Doc_name':file_name,

'Doc_link':sas_url,

'Title': title,

'FullText': full_text,

'PageText': page,

'entities': entities

}Code language: JavaScript (javascript)

From here, a stored procedure was implemented to convert and store the data in SQL Server.

Navigating your new database with Azure Cognitive Search

With all document data extracted and stored in a relational database, powerful and efficient search strategies are possible. For this, Azure Cognitive Search provides effective functionality. It can work with SQL databases, blob storage and other properly structured data sources containing extracted metadata – fields like title, date, category, source and so on. Once configured, requests to the Azure Cognitive Search platform are by REST or .Net requests and responses.

Azure Cognitive Search is a cloud-based search service that works with heterogeneous content. This AI-powered flexibility means that it is especially well-suited to digitized document resources where a variety of types and formats are at stake. Azure Cognitive Search includes the following capacities:

A search engine that incorporates full-text resources.
Rich indexing, drawing on lexical analysis and AI enrichment.
Rich query searching, including fuzzy search, autocomplete and geo-search.
REST API and client SDK programmable accessibility.
Data-level Azure integration for machine learning and AI.

Best practices for digitalization processes

Let’s finish up with three best practice tips for your own AI-powered digitization projects.

1. Have secure and shared storage

To get the best out of digital resources, accessibility is key. This is one of the primary motivations for using cloud services like Azure, which offer extensive multi-user and location-neutral accessibility. Of course, by embracing shared access, you must not sacrifice security. Azure has built-in controls for multi-level security, meaning you can tightly control who has access to which resources. It also includes protections against malicious users, such as Microsoft Defender for Cloud.

2. Foster a ‘paperless’ corporate culture

Having adequate infrastructures and digital technologies are essential conditions for a paperless office. But to work in practice, it needs all staff to stick to digital working practices and avoid accumulating paper resources. Understandably, some are quite used to conventional ways of working and may react badly to change if it is imposed too forcefully. A better approach is to develop a corporate culture where digital resources become the norm. Encourage through setting a good example, so that everyone can see the benefits of digitization. And provide training, to allow employees to use their new tools effectively and with confidence.

3. Make search easy

Easy search should be a big win for digitalization with AI. The difficulty of finding physical papers in a disordered filing system is well-known, but converting to digital resources doesn’t necessarily make things easier. Search technologies have improved significantly since the days when orderly folder management was essential, but some resources can still be hard to find without organization. To get the best out of the paperless office then, it’s important to leverage effective search technologies. Azure’s Cognitive Search is ideal as it combines powerful indexing, lexical analysis and AI enrichment in a cloud-based platform.

Conclusions: go digital!

Going fully paperless is long overdue. And now we have the cloud-based AI tools to make it a reality, there’s no excuse to lag behind. With platforms like Azure, and codable technologies using simple languages like Python, companies can reap the benefits of digitization with ease. Digital indexing, for example, is much more powerful and comprehensive than manual indexing. Digital search of document stores like journals can yield more useful information than manual search. And by going fully digital, you can avoid manual handling errors, losses or paper document misplacements.

A Guide to Digitalisation: Going Paperless with AI