Capturing Data from Non-Standard SSL Sites

The Apache Tika Parser integrated into Fusion is capable of handling just about any crawl demanded. However, in some cases, the security around the parser may cause issue, E.g for sites with self-signed certificates, or sites with expired or otherwise invalid certificates.  This can make an SSL crawl difficult, at best.  To get around this,  you can create a custom JavaScript stage that will acquire the content without regard to the validity of the cert.  

NOTE: You should only use this when you're crawling trusted sites.  

With that said, below is the sample code for acquiring content directly using native Java:

function (doc) {
    var BufferedReader = java.io.BufferedReader;
    var InputStreamReader = java.io.InputStreamReader;
    var URL = java.net.URL;
    var String = java.lang.String;
    var X509Certificate = java.security.cert.X509Certificate;
    var HostnameVerifier = javax.net.ssl.HostnameVerifier;
    var HttpsURLConnection = javax.net.ssl.HttpsURLConnection;
    var SSLContext = javax.net.ssl.SSLContext;
    var SSLSession = javax.net.ssl.SSLSession;
    var TrustManager = javax.net.ssl.TrustManager;
    var X509TrustManager = javax.net.ssl.X509TrustManager;

    var e = java.langException;
    var stdout = "";
    var trustAllCerts = Java.type("TrustManager[]");
    var certs = Java.type("X509Certificate[] ");

    try {

           var x509 =   Java.extend(X509TrustManager, {
                getAcceptedIssuers: function() {
                    return null;
                },
                checkClientTrusted:  function( certs, authType) {
                },
                checkServerTrusted: function(certs, authType) {
                }
            });
         trustAllCerts = new TrustManager[1]; 
         trustAllCerts[0] = x509;
         
        
                // Install the all-trusting trust manager
        var sc = SSLContext.getInstance("SSL");
        sc.init(null, trustAllCerts, new java.security.SecureRandom());
        HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
 
        // Create all-trusting host name verifier
        var allHostsValid = new HostnameVerifier {
            verify: function( hostname,  session) {
                return true;
            }
        };
 
        // Install the all-trusting host verifier
        HttpsURLConnection.setDefaultHostnameVerifier(allHostsValid);
        var oracle = new URL(doc.getId());
        var isr = new InputStreamReader(oracle.openStream());
        var ins = new BufferedReader();

        var inputLine = "";
        while ((inputLine = ins.readLine()) !== null) {
            logger.info(inputLine);
            stdout += inputLine;
        }
        ins.close();
        doc.addField("body", stdout);
        doc.addField("_raw_content_", stdout);
    } catch (e) {
        logger.error(e);
    }
    return doc;
}

So what this does, essentially is this:

  1. Declare the Java classes we will be using.
  2. Declare our local variables. 
  3. Create an all-trusting TrustManager
  4. Install the TrustManager -- Note: This will affect ANY requests made under this instance of the JVM
  5. Instantiate a new URL (when using the Web datasource in Fusion, the url is the id of the document). 
  6. Instantiate a new InputStreamReader and feed that into a new BufferedReader. 
  7. Iterator over the stream and concatenate the lines into the final 'stdout' variable. 
  8. Close the stream.  
  9. Add our content to our PipelineDocument.  In this case we set both 'body' and '_raw_content_' just as the Tika Parser would do. 
  10. return the PipelineDocument. 
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk