Tuesday, February 14, 2006

MSXML vb.net: Using MSHTML and IPersistStreamInit to load documents from memory

The basic code to load any string and convert it into a DOM tree which you can easily manipulate using VB.net or C#. Once you load the HTML code you can use getElementById and other functions to manipulate the document.This code can be used to parse a HTML document in VB.net

Private Function LoadHTML(ByVal value As String) As MSHTML.HTMLDocument
Dim clsDocument As New MSHTML.HTMLDocument
clsDocument.createDocumentFromUrl("about:blank", vbNullString)
DirectCast(clsDocument, IPersistStreamInit).InitNew()
Dim ptrValue As IntPtr = System.Runtime.InteropServices.Marshal.StringToHGlobalAuto(value)
Dim clsStream As System.Runtime.InteropServices.ComTypes.IStream = Nothing
CreateStreamOnHGlobal(ptrValue, True, clsStream)
' load the content into the browser..
DirectCast(clsDocument, IPersistStreamInit).Load(clsStream)
Return clsDocument
End Function

You also need to add the following code to the class you are going to use LoadHTML function.

Import System.Runtime.InteropServices
Public Enum HRESULT
S_OK = 0
E_NOTIMPL = &H80004001
E_INVALIDARG = &H80070057
E_NOINTERFACE = &H80004002
E_FAIL = &H80004005
End Enum
<ComVisible(True), ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), _
InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPersistStreamInit : Inherits IPersist
Shadows Sub GetClassID(ByRef pClassID As Guid)
<PreserveSig()> Function IsDirty() As Integer
<PreserveSig()> Function Load(ByVal pstm As ComTypes.IStream) As HRESULT
<PreserveSig()> Function Save(ByVal pstm As ComTypes.IStream, _
<MarshalAs(UnmanagedType.Bool)> ByVal fClearDirty As Boolean) As HRESULT
<PreserveSig()> Function GetSizeMax(<InAttribute(), Out(), _
MarshalAs(UnmanagedType.U8)> ByRef pcbSize As Long) As HRESULT
<PreserveSig()> Function InitNew() As HRESULT
End Interface

<ComVisible(True), ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), _
InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
Public Interface IPersist
Sub GetClassID(ByRef pClassID As Guid)
End Interface

Declare Function CreateStreamOnHGlobal Lib "ole32" (ByVal hGlobal As IntPtr, ByVal fDeleteOnRelease As Boolean, ByRef ppstm As UCOMIStream) As Long

I got the basic idea from Balaji's blog which he got inturn from sp!ke. This code has been slightly modified to work in Visual Studio 2005.

Technorati Tags: ,


Anonymous said...


I was checking out your code, and I was wondering if value in LoadHTML is the URL, or the HTML as a string. Some of your code is a bit over my head, so it is hard for me to understand.

Thank you.

Vivek Jishtu said...

The first parameter is the HTML as string.