How To Read DOC file Using Java and Apache POI

One of the visitors of my blog asked me write how to read a document file using Java. I wrote the following program to demonstrate how Apache POI can be used for this purpose.

I have used the following API to write this program. If you have downloaded the Apache POI, you should fine this jar file within the bundle.

poi-scratchpad-3.2-FINAL-20081019.jar

The tutorial demonstrates the following features:

–How to read a simple Microsoft word document file using Java and Apache POI (.docx not supported)
–This includes the ability to read total number of paragraph and the paragraph content
–How to read the document headers
–How to read the document footers
–How to read the document summary

Apache POI is not robust yet. It has a long way to go through to handle complex document formats. Moreover I figured out that from one version to another, the classes are moving from one package to another. So if you are using the older/newer version of POI, in case of any compilation error for imports, try finding the classes in some other packages.

You can download the sample document that I used to read using the following program. You can also download the source code for this application. You are free to use and distribute the code. It comes with no warranty at all. I will be honored if you link back to my blog as a source.

/**
 * @author Kushal Paudyal
 * www.sanjaal.com/java
 * Last Modified On: 03/23/2009
 */
package com.kushal.utils;

import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.hpsf.DocumentSummaryInformation;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.extractor.*;
import org.apache.poi.hwpf.usermodel.HeaderStories;

import java.io.*;

public class ReadDocFileFromJava {

	public static void main(String[] args) {
		/**This is the document that you want to read using Java.**/
		String fileName = "C:\\Documents and Settings\\kushalp\\Desktop\\Test.doc";

		/**Method call to read the document (demonstrate some useage of POI)**/
		readMyDocument(fileName);

	}
	public static void readMyDocument(String fileName){
		POIFSFileSystem fs = null;
		try {
			fs = new POIFSFileSystem(new FileInputStream(fileName));
			HWPFDocument doc = new HWPFDocument(fs);

			/** Read the content **/
			readParagraphs(doc);

			int pageNumber=1;

			/** We will try reading the header for page 1**/
			readHeader(doc, pageNumber);

			/** Let's try reading the footer for page 1**/
			readFooter(doc, pageNumber);

			/** Read the document summary**/
			readDocumentSummary(doc);

		} catch (Exception e) {
			e.printStackTrace();
		}
	}	

	public static void readParagraphs(HWPFDocument doc) throws Exception{
		WordExtractor we = new WordExtractor(doc);

		/**Get the total number of paragraphs**/
		String[] paragraphs = we.getParagraphText();
		System.out.println("Total Paragraphs: "+paragraphs.length);

		for (int i = 0; i &lt; paragraphs.length; i++) {

			System.out.println("Length of paragraph "+(i +1)+": "+ paragraphs[i].length());
			System.out.println(paragraphs[i].toString());

		}

	}

	public static void readHeader(HWPFDocument doc, int pageNumber){
		HeaderStories headerStore = new HeaderStories( doc);
		String header = headerStore.getHeader(pageNumber);
		System.out.println("Header Is: "+header);

	}

	public static void readFooter(HWPFDocument doc, int pageNumber){
		HeaderStories headerStore = new HeaderStories( doc);
		String footer = headerStore.getFooter(pageNumber);
		System.out.println("Footer Is: "+footer);

	}

	public static void readDocumentSummary(HWPFDocument doc) {
		DocumentSummaryInformation summaryInfo=doc.getDocumentSummaryInformation();
		String category = summaryInfo.getCategory();
		String company = summaryInfo.getCompany();
		int lineCount=summaryInfo.getLineCount();
		int sectionCount=summaryInfo.getSectionCount();
		int slideCount=summaryInfo.getSlideCount();

		System.out.println("---------------------------");
		System.out.println("Category: "+category);
		System.out.println("Company: "+company);
		System.out.println("Line Count: "+lineCount);
		System.out.println("Section Count: "+sectionCount);
		System.out.println("Slide Count: "+slideCount);

	}

}

Originally posted 2009-03-23 19:43:54.

Image may be NSFW.
Clik here to view.

How To Read DOC file Using Java and Apache POI

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112