How to extract text from PDF and post into Solr

The pdf is a common format for ebooks and other documents. It's a problem to find information quickly in pdf files when you have hundreds of them. This post will teach you how to extract these information and send them to Solr so that you can quickly locate files that contains information you are looking for.

Goto your Solr install directory and start the Server with the command

 
F:\setup\jar\solr-4.2.1\example>java -jar start.jar
 

Post one pdf file is easy:

 
 
F:\tmp>curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "myfile=@Solr.pdf"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">2683</int></lst>
</response>
 
 

The /update/extract is our requestHandler , you can find the configuration in \solr-4.2.1\example\solr\collection1\conf\solrconfig.xml.

 
  <!-- Solr Cell Update Request Handler
 
       http://wiki.apache.org/solr/ExtractingRequestHandler 
 
    -->
  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
 
      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>
 

The literal.id correspond to "id" field defined in F:\setup\jar\solr-4.2.1\example\solr\collection1\conf\schema.xml.

 
 
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
 
 

If you use the default schema.xml. The field "content" is stored, means all the text extracted from the pdf file is stored in Solr document.

 
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
 

This field is for search highlight , but I noticed that the query is slow if text is stored into Solr document. If you don't need the highlight, just turn it off, the query will be very fast.

 
<field name="content" type="text_general" indexed="false" stored="false" multiValued="true"/>
 

Query a keyword:

 
  <doc>
    <arr name="links">
      <str>http://www.PacktPub.com</str>  
    </arr>
    <str name="id">1</str>
    <arr name="content_type">
      <str>application/pdf</str>
    </arr>
    <long name="_version_">1435453156197662720</long></doc>
 

Its surprising that the command didn't add the file name to the document. You can add the file name to a field as you want, for example the title.

 
curl "http://localhost:8983/solr/update/extract?literal.id=1&literal.title=Solr.pdf&commit=true" -F "myfile=@Solr.pdf"
 

If you have many pdf files, its better to index it with script. Here is a PHP script that can post all pdf file in current directory

 
 
$cwd = getcwd();
 
$filepath = $cwd;
$filelist = `ls *.pdf`;
$filelist = explode("\n", $filelist);
unset($filelist[count($filelist) - 1]);
 
$indexedtext = file_get_contents('indexed.txt');
$indexed = explode("\n", $indexedtext );
unset($indexed [count($indexed ) - 1]);
 
 
foreach($filelist as $file){
    if(ValueExist($indexed, $file)) {
        echo "[DONE]:".$file . 'file already indexed , coninute ' . "\n";        
        continue;
    }
    echo "//=======================START index\n";
    echo "File: ". $file . "\n";
 
    $cmdstr = 'curl "http://localhost:8983/solr/update/extract?literal.id='.md5($file).'&literal.filename='.urlencode($file).'&literal.filepath='.urlencode($filepath).'&commit=true" -F "myfile=@'.$file.'"';
    echo "curl Command: " . $cmdstr . "\n";
    $response =  `$cmdstr` ;
    echo "curl response is " . $response;
    echo "\n";
    if(strstr($response, "<int name=\"status\">0<")) {
        echo "[SUCCESS]:".$file. "successfully indexed. \n";
        $f = fopen('indexed.txt', "w");
        $indexedtext .= $file . "\n";
        fwrite($f, $indexedtext);
        fclose($f);
    } else {
        echo "[FAILED]:". $file. "failed to index\n";
    }
    echo "//=======================END index\n";
    ECHO "\n\n\n";
 
}
 
echo count($filelist). "files\n";
exit;
 

This script get the list of all pdf files and construct curl command and then execute them one by one. When a file is successfully indexed, write the file name into a txt file, next time run this script, these files will not be processed. If you add new files to the directory, just run the script again, it will identify which one is new file.

This script add the path and the file name to the document. Add the two lines to schema.xml

 
   <field name="filename" type="text_general" indexed="true" stored="true"/>
   <field name="filepath" type="text_general" indexed="true" stored="true"/>
 
 

failed to creating formpost

One of the problem you may come across is "failed to creating formpost". This because your file name contains comma, curl may think its a delimiter.

Here is a PHP script to rename all file name and strip the comma:

 
<?php
 
$filelist = explode("\n", file_get_contents('y.txt'));
unset($filelist[count($filelist) - 1 ]);
 
foreach(  $filelist  $file ) {
    echo "rename ";
    echo "\"". str_replace("\r", "",$file) .  "\"". " "; 
    $newname = str_replace(",", " " , $file);
    echo "\"". str_replace("\r", "",$newname ) . "\"".  "\n";
 
}
?>
 

Redirect the output to a bat file and then execute it.

Or you can rename any corrupted file name at the beginning of indexing.

 
 
$cwd = getcwd();
 
$filepath = $cwd;
$filelist = `ls *.pdf`;
$filelist = explode("\n", $filelist);
unset($filelist[count($filelist) - 1]);
 
foreach($filelist as $key => $file ){
    // renameing it if neccessary
    $newname = str_replace(",", " " , $file);
    $newname = str_replace(";", " " , $newname);
    if($newname != $file) {
        echo "[BAD NAME, REANMING]" . $newname;
        rename($file, $newname);
    }
    $filelist[$key] = $newname;
}
 
$indexedtext = file_get_contents('indexed.txt');
$indexed = explode("\n", $indexedtext );
unset($indexed [count($indexed ) - 1]);
 
 
foreach($filelist as $file){
    if(ValueExist($indexed, $file)) {
        echo "[DONE]:".$file . 'file already indexed , coninute ' . "\n";        
        continue;
    }
    echo "//=======================START index\n";
    echo "File: ". $file . "\n";
 
    $cmdstr = 'curl "http://localhost:8983/solr/update/extract?literal.id='.md5($file).'&literal.filename='.urlencode($file).'&literal.filepath='.urlencode($filepath).'&commit=true" -F "myfile=@'.$file.'"';
    echo "curl Command: " . $cmdstr . "\n";
    $response =  `$cmdstr` ;
    echo "curl response is " . $response;
    echo "\n";
    if(strstr($response, "<int name=\"status\">0<")) {
        echo "[SUCCESS]:".$file. "successfully indexed. \n";
        $f = fopen('indexed.txt', "w");
        $indexedtext .= $file . "\n";
        fwrite($f, $indexedtext);
        fclose($f);
    } else {
        echo "[FAILED]:". $file. "failed to index\n";
    }
    echo "//=======================END index\n";
    ECHO "\n\n\n";
 
}
 
echo count($filelist). "files\n";
exit;