Processador genérico de documentos estruturados (XML)

Braga, 10 de Abril de 2000

Alexandra Pinto

E-mail: A19670@correio.ci.uminho.pt

André Sousa

Resumo:

Nesta primeira fase deste trabalho, e no âmbito desta disciplina, o objectivo é o processamento de documentos estruturados e o armazenamento da informação, neles contida, num grove .

Índice

Introdução
Análise
Desenvolvimento
Conclusão
Anexos

Introdução

Este trabalho encontra-se dividido em três fases, sendo este relatório referente à primeira fase. A primeira fase, por sua vez, encontra-se subdividida em três etapas.

Na primeira etapa, pretendia-se usar as ferramentas lex e yacc/bison para construir um analisador léxico e um analisador sintáctico, para a gramática disponibilizada no protocolo.

A segunda etapa tinha como finalidade a construção de um analisador semântico, onde se verifica a abertura e fecho de tags pela seguinte ordem: a última tag aberta é a primeira a ser fechada.

Na terceira etapa, conclui-se o objectivo da primeira fase, sendo este, como já referido, o armazenamento da informação, do documento estruturado, numa estrutura ( grove ).

Análise

Analisador Léxico/Sintáctico (Primeira Etapa)

Depois de analisar a gramática fornecida, foram realizadas algumas pequenas alterações a esta, para um melhor reconhecimento dos documentos estruturados.

Analisador Semântico (Segunda Etapa)

Para atingir o fim desta etapa, era necessário realizar a verificação de abertura e fecho de tags . Assim, a última tag aberta é a primeira a ser fechada.

Para contornar este problema, foi utilizada uma stack . Com este tipo de estrutura a resolução do problema foi facilitada.

Armazenamento da informação (Terceira Etapa)

Para armazenar a informação contida nos documentos estrurados, foi utilizada a estrutura ( grove ) fornecida pelo docente.

Analisada cuidadosamente essa estrutura, foram efectuadas umas pequenas alterações (ver ANEXO A).

Desenvolvimento

Analisador Léxico/Sintáctico (Primeira Etapa)

O desenvolvimento do analisador léxico e analisador sintáctico foi realizado em flex e yacc , respectivamente.

Esta primeira etapa, não apresentou qualquer problema no seu desenvolvimento.

Para consultar o código implementado ao longo desta etapa, consultar ANEXO B.

Analisador Semântico (Segunda Etapa)

Nesta etapa, o problema que se colocava era a verificação da abertura e fecho de tags . Para tal,foi implementada uma stack do tipo LIFO.

Assim, à medida que vão sendo abertas tags , os seus identificadores vão sendo armazenados na stack . Da mesma forma, à medida que as tags vão sendo fechadas, vão sendo removidos os seus identificadores da stack .

Para ver a implementação realizada na segunda etapa, consultar ANEXO C.

Armazenamento da informação (Terceira Etapa)

Como ponto de partida para esta etapa, foi utilizada a estrutura fornecida pelo docente. Esta foi melhorada (ver ANEXO A), sendo inserido um novo campo na estrutura (campo type ). Esta alteração teve como objectivo a obtenção de um algoritmo mais eficiente e rápido.

Nesta etapa, foram realizadas diversas alterações no código já implementado.

No analisador léxico, foram implementadas start conditions , para reconhecer valores, atributos, identificadores e texto.

No analisador sintáctico, foi removido o uso da stack para verificação da abertura/fecho de tags , sendo esta reutilizada com algumas alterações para facilitar a inserção da informação no grove .

Foram implementados novos procedimentos para concatenação de strings e atributos, também para facilitar/simplificar a inserção no grove .

Todos estes procedimentos foram implementados de forma dinâmica, isto é foi realizada alocação e libertação de memória mediante o uso das estruturas.

Tal como era pedido no protocolo da terceira etapa, foi implementada a travessia do grove . O código implementado nesta etapa encontra-se em anexo (ver ANEXO D).

Conclusão

Esta primeira fase do trabalho foi bastante aliciante. Permitiu-nos consolidar os conhecimentos adquiridos nas ferramentas utilizadas e na linguagem C.

Foram surgindo alguns problemas ao longo do seu desenvolvimento, os quais foram ultrapassados da forma mais rápida e eficiente que foi encontrada.

Anexos

Anexo A - Estrutura Grove

Grove disponibilizado


typedef struct strGrove{
  char *id;
  union X{
    char* text;
    struct element{
      struct strGrove* content;
      attrib* attr;
    }elem;
  }cont;
  struct strGrove* brother;
}Gnode;

Grove implementado


typedef struct strGrove{
  char type;
  union X{
    char* text;
    struct element{
      char* id;
      struct strGrove* content;
      attrib* attr;
    }elem;
  }cont;
  struct strGrove* brother;
}Gnode;

Anexo B - Código da Primeira Etapa

Analisador Léxico


%{
#include "y.tab.h"
%}

CHAR	[a-zA-Z-]
VALOR	[0-9]+
ID 	{CHAR}+
%%

{CHAR}	{printf("CHAR\n"); return(CHAR);}

{VALOR}	{printf("VALOR\n"); return(VALOR);}

{ID}	{printf("ID\n"); return(ID);}

\<	{printf("<\n"); return('<');}

\>	{printf(">\n"); return('>');}

\/	{printf("/\n"); return('/');}

\&	{printf("&\n"); return('&');}

\;	{printf(";\n"); return(';');}

\=	{printf("=\n"); return('=');}

[ \t\n]+ ;

<<EOF>> {printf("EOF\n"); return(0);}
%%

Analisador Sintáctico


%token ID VALOR CHAR

%start Documento
%%

Documento:  ElemList ;

ElemList:   CHAR
	  | '&' ID ';'
          | '<' ID AttrList '>' ElemList '<' '/' ID '>'
	  | '<' ID AttrList '/' '>' ;

AttrList:   Attr AttrList
	  | ;

Attr:       ID '=' VALOR ;
%%

#include "lex.yy.c"
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h> 
#include <fcntl.h>


void yyerror(char *s){
  printf("-->%s\n",s);
}

int main(int argc, char **argv){
  int fd;

  if(argc<2){
    printf("Use: 1fase input_file\n");
    return(-1);
  }
  if((fd=open(argv[1],O_RDONLY))==-1){
    perror("Input File");
    return(-1);
  }
  if(dup2(fd,STDIN_FILENO)==-1){
    perror("Dup");
    close(fd);
    return(-1);
  }
  close(fd);
  yyparse();
  return(0);
}

Anexo C - Código da Segunda Etapa

Analisador Léxico


%{
#include <string.h>
int lineno=1;
%}

%s OPENTAG ATTRLST

CHAR	[a-zA-Z-]
VALOR	\"{CHAR}[a-zA-Z0-9]*\"
ID 	{CHAR}+
%%

<OPENTAG>{ID}    { printf("ID\n");
                   yylval.s=strdup(yytext);
                   BEGIN ATTRLST;
                   return(ID); }

<ATTRLST>\>      { printf(">\n");
                   BEGIN 0;
                   return('>'); }

<ATTRLST>{ID}    { printf("ATTR\n");
                   yylval.s=strdup(yytext);
                   return(ATTR); }

<ATTRLST>\=      { printf("=\n");
                   return('='); }

<ATTRLST>{VALOR} { printf("VALOR\n");
                   return(VALOR); }

<OPENTAG>\/      { printf("/\n");
                   return('/'); }

{ID}	         { printf("TEXTO\n");
                   yylval.s=strdup(yytext);
                   return(TEXTO); }

\<	         { printf("<\n");
                   BEGIN OPENTAG;
                   return('<'); }

\&	         { printf("&\n");
                   return('&'); }

\;	         { printf(";\n");
                   return(';'); }

[ \t]+           ;

\n	         lineno++;

<<EOF>>          { printf("EOF\n");
                   return(0); }
%%

Analisador Sintáctico


%token ID VALOR TEXTO ATTR
%union{ char* s; }
%type <s> ID
%start Documento
%%

Documento: '<' ID '>' ElemList '<' '/' ID '>' {if (strcmp($2,"DOC")!=0)
                                                yyerror("Wrong initial tag");
                                               if (strcmp($7,"DOC")!=0)
                                                yyerror("Wrong end tag");};

ElemList:   ElemList Elem
	  | Elem;

Elem:       TEXTO
	  | '&' ID ';'
          | '<' ID AttrList '>' ElemList '<' '/' ID '>' {push($2); pop($8);}
	  | '<' ID AttrList '/' '>' ;

AttrList:   AttrList Attr
	  | ;

Attr:       ATTR '=' VALOR ;

%%

#include "lex.yy.c"
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h> 
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct stack{
  char* tag;
  struct stack* next;
}stk;
	
stk* st;

void yyerror(char *str){
  stk* tmp;

  printf("%s [line %d]\n",str,lineno);
  if(!isEmpty()){
    while(st->next){
      tmp=st;
      free(tmp->tag);
      free(tmp);
      st=st->next;
    }
    free(st);
  }
  exit(-1);
}

int push(char* str){
  stk* tmp;

  if(!st){
    if(st = (stk*)malloc(sizeof(stk))){
      st->tag=NULL;
      st->next=NULL;
    }
    else yyerror("PUSH: Malloc Error");
  }
  if(!str) yyerror("PUSH: String NULL");
  if(!st->tag) st->tag=strdup(str);
  else{
    if(tmp = (stk*)malloc(sizeof(stk))){
      tmp->tag=strdup(str);
      tmp->next=st;
      st=tmp;	
    }
    else yyerror("PUSH: Malloc Error");
  }
  return (0);
}

int pop(char* str){
  stk* tmp;

  if(!str) yyerror("POP: String NULL");
  if(isEmpty()) yyerror("POP: Stack is Empty");
  if(strcmp(str,st->tag)==0){
    tmp=st;
    st=st->next;
    free(tmp->tag);
    free(tmp);
    return(0);	
  }
  else yyerror("POP: Wrong end tag");
}

int isEmpty(){
  if(st) return(0);
  else return(-1);
}	

int main(int argc, char **argv){
  int fd;
  stk* tmp;

  if(argc<2){
    printf("Use: etapa2 input_file\n");
    return(-1);
  }
  if((fd=open(argv[1],O_RDONLY))==-1){
    perror("Input File");
    return(-1);
  }
  if(dup2(fd,STDIN_FILENO)==-1){
    perror("Dup");
    close(fd);
    return(-1);
  }
  close(fd);
  yyparse();
  
  if(!isEmpty()) yyerror("MAIN: The Stack isn't Empty");

  return(0);
}

Anexo D - Código da Terceira Etapa

Analisador Léxico


%{
#include <string.h>
int lineno=1;
int doc=1;
%}

%s OPENTAG ATTRLST VALUE

CHAR	[a-zA-Z-]
VALOR	{CHAR}[a-zA-Z0-9]*
ID 	{CHAR}+
TEXTO	[a-zA-Z0-9.\_\-\!\#\$\%\(\)\{\}\[\]\,\:]+
%%

<OPENTAG>[ \t\n]+ { if(yytext[0]=='\n') lineno++; }

<OPENTAG>{ID}     { yylval.s=strdup(yytext);
                    BEGIN ATTRLST;
                    return(ID); }

<OPENTAG>\/         return('/');

<ATTRLST>[ \t\n]+   if(yytext[0]=='\n') lineno++;

<ATTRLST>\>       { BEGIN 0;
                    return('>'); }

<ATTRLST>{ID}     { yylval.s=strdup(yytext);
                    return(ATTR); }

<ATTRLST>\=         return('=');

<ATTRLST>\"         BEGIN VALUE;

<ATTRLST>\/         return('/');

<VALUE>{VALOR}    { yylval.s=strdup(yytext);
                    return(VALOR); }

<VALUE>\"           BEGIN ATTRLST;

{TEXTO}	          { yylval.s=strdup(yytext);
                    return(TEXTO); }

[ ]+		  { if(doc){
                      yylval.s=strdup(yytext);
                      return(TEXTO); }
                    else return(0); }

\t+		  { if(doc){
                      yylval.s=strdup(yytext);
                      return(TEXTO); }
                    else return(0); }

\n		  { if(doc){
                      yylval.s=strdup(yytext);
                      lineno++;
                      return(TEXTO); }
                    else return(0); }

\<	          { BEGIN OPENTAG;
                    return('<'); }

\&	            return('&');

\;	            return(';');

<<EOF>>             return(0);

Analisador Sintáctico


%token ID VALOR TEXTO ATTR
%union{ char* s; }
%type <s> ID TEXTO VALOR ATTR
%start Documento
%%

Documento: '<' ID  '>' {if (strcmp($2,"DOC")!=0) {yyerror("Wrong initial tag");} openTag($2,NULL);} ElemList '<' '/' ID {if (strcmp($8,"DOC")!=0) yyerror("Wrong end tag");} '>' {doc=0; insertText(resetText()); closeTag();};

ElemList:   ElemList Elem
	  | Elem;

Elem:       TEXTO { catText($1); }
          | '&' TEXTO ';' { catText("&"); catText($2); catText(";"); }
          | '<' ID AttrList '>' {insertText(resetText()); openTag($2,resetAttr());} ElemList '<' '/' ID '>' {if(strcmp($2,$9)!=0) yyerror("Wrong end tag"); insertText(resetText()); closeTag(); }
	  | '<' ID AttrList '/' '>' { insertText(resetText()); openTag($2,resetAttr()); closeTag(); };

AttrList:   AttrList Attr
	  | ;

Attr:       ATTR '=' VALOR { catAttr($1,$3); } ;

%%

#include "lex.yy.c"
#include "grove.h"
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h> 
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char *text=NULL;
attrib *attr=NULL;

void yyerror(char *str){
  printf("%s [line %d]\n",str,lineno);
  exit(-1);
}

char* resetText(){
  char* tmp;

  tmp=text;
  text=NULL;
  return(tmp);
}

void catText(char *str){
  if(!text) text=str;
  else{
   text=realloc(text,strlen(text)+strlen(str)+1);
   strcat(text,str);
  }
}

attrib *resetAttr(){
  attrib *tmp;

  tmp=attr;
  attr=NULL;
  return(tmp);
}

void catAttr(char *id, char* atr){
  attrib *tmp;

  if(!attr){
    if((attr=(attrib *)malloc(sizeof(attrib)))==NULL)
      yyerror("catAttr: Malloc Error");
    attr->name=id;
    attr->value=atr;
    attr->next=NULL;
  }
  else{
    tmp=attr;
    while(tmp->next) tmp->next;
    if((tmp->next=(attrib *)malloc(sizeof(attrib)))==NULL)
      yyerror("catAttr: Malloc Error");
    tmp->next->name=id;
    tmp->next->value=atr;
    tmp->next->next=NULL;
  }
}

int main(int argc, char **argv){
  int fd;

  if(argc<2){
    printf("Use: etapa2 input_file\n");
    return(-1);
  }
  if((fd=open(argv[1],O_RDONLY))==-1){
    perror("Input File");
    return(-1);
  }
  if(dup2(fd,STDIN_FILENO)==-1){
    perror("Dup");
    close(fd);
    return(-1);
  }
  close(fd);
  yyparse();
  travessia();
  freeGrove();
  return(0);
}

Estrutura do Grove e Stack


#ifndef __GROVE__
#define __GROVE__

#include<stdlib.h>

typedef struct strAttr{
  char* name;
  char* value;
  struct strAttr* next;
}attrib;

typedef struct strGrove{
  char type;
  union X{
    char* text;
    struct element{
      char* id;
      struct strGrove* content;
      attrib* attr;
    }elem;
  }cont;
  struct strGrove* brother;
}Gnode;

typedef struct stack{
  Gnode* tag;
  struct stack* next;
}stk;

Gnode *top(void);

Gnode *pop(void);

int push(Gnode *ptr);

int isEmpty(void);

void openTag(char *id, attrib *attr);

void closeTag(void);

void travessia(void);

void aux_travessia(Gnode *node);

void freeGrove(void);

void aux_freeGrove(Gnode *node);

#endif

Implementação do Grove e Stack


#include"grove.h"
#include<stdlib.h>

stk *allptr=NULL;
Gnode *grove=NULL;

int push(Gnode *ptr){
  stk* tmp;

  if(isEmpty()){
    if(allptr = (stk*)malloc(sizeof(stk))){
      allptr->tag=ptr;
      allptr->next=NULL;
    }
    else {printf("PUSH: Malloc Error\n"); return(-1);}
  }
  else{
    if(tmp = (stk*)malloc(sizeof(stk))){
      tmp->tag=ptr;
      tmp->next=allptr;
      allptr=tmp;	
    }
    else {printf("PUSH: Malloc Error\n"); return(-1);}
  }
  return (0);
}

Gnode *pop(){
  Gnode* tmp;
  stk *tmp1;

  if(isEmpty()) {printf("POP: Stack is Empty\n"); exit(-1);}
  else{
    tmp1=allptr;
    tmp=allptr->tag;
    allptr=allptr->next;
    free(tmp1);
    return(tmp);	
  }
}

Gnode *top(){
  return(allptr->tag);
}

int isEmpty(){
  if(allptr) return(0);
  else return(-1);
}

void closeTag(){
  pop();
}

void insertText(char *str){
  Gnode *newText;
  Gnode *tmp;
 
  if(!str) return;
  if(!top()->cont.elem.content){
    if((top()->cont.elem.content=(Gnode *)malloc(sizeof(Gnode)))==NULL){
      printf("insertText: Malloc Error\n");
      exit(-1);
    }
    newText=top()->cont.elem.content;
  }
  else{
    tmp=top()->cont.elem.content;
    while(tmp->brother) tmp=tmp->brother; 
    if((tmp->brother=(Gnode *)malloc(sizeof(Gnode)))==NULL){
      printf("insertText: Malloc Error\n");
      exit(-1);
    }
    newText=tmp->brother;
  }
  newText->type='X';
  newText->cont.text=str;
  newText->brother=NULL;
}

void openTag(char *id, attrib *attr){
  Gnode *newTag;
  Gnode *tmp;

  if(!grove){
    if((grove=(Gnode *)malloc(sizeof(Gnode)))==NULL){
      printf("openTag: Malloc Error\n");
      exit(-1);
    }
    else newTag=grove;
  }
  else{
    if(!top()->cont.elem.content){
      if((top()->cont.elem.content=(Gnode *)malloc(sizeof(Gnode)))==NULL){
	 printf("openTag: Malloc Error\n");
	 exit(-1);
      }
      else newTag=top()->cont.elem.content;
    }
    else{
      tmp=top()->cont.elem.content;
      while(tmp->brother) tmp=tmp->brother; 
      if((tmp->brother=(Gnode *)malloc(sizeof(Gnode)))==NULL){
	 printf("openTag: Malloc Error\n");
	 exit(-1);
      }
      newTag=tmp->brother;
    }
  }
  newTag->type='G';
  newTag->cont.elem.id=id;
  newTag->cont.elem.content=NULL;
  newTag->cont.elem.attr=attr;
  newTag->brother=NULL;
  push(newTag);
}

void travessia(){
  if(grove){
    printf("( %s\n",grove->cont.elem.id);
    if(grove->cont.elem.content) aux_travessia(grove->cont.elem.content);
    printf(") %s\n",grove->cont.elem.id);
  }
}

void aux_travessia(Gnode *node){
  attrib *tmp;

  if(node->type=='G'){
    tmp=node->cont.elem.attr;
    while(tmp){
      printf("A %s=%s\n",tmp->name,tmp->value);
      tmp=tmp->next;
    }
    printf("( %s\n",node->cont.elem.id);
    if(node->cont.elem.content) aux_travessia(node->cont.elem.content);
    printf(") %s\n",node->cont.elem.id);
  }
  else printf("-%s\n",node->cont.text);
  if(node->brother) aux_travessia(node->brother);
}

void freeGrove(){
  if(grove){
    if(grove->cont.elem.content) aux_freeGrove(grove->cont.elem.content);
    free(grove->cont.elem.id);
    free(grove->cont.elem.content);
    free(grove);
  }
}

void aux_freeGrove(Gnode *node){
  attrib *tmp;
  attrib *aux;
  
  if(node->type=='G'){
    if(node->cont.elem.content) aux_freeGrove(node->cont.elem.content);
    tmp=node->cont.elem.attr;
    while(tmp){
      free(tmp->name);
      free(tmp->value);
      aux=tmp;
      tmp=aux->next;
      free(aux);
    }
    free(node->cont.elem.id);
    free(node->cont.elem.content);
  }
  else free(node->cont.text);

  if(node->brother) aux_freeGrove(node->brother);
  free(node);
}

Agradecimentos:

Bibliografia:

[1 ] Kernighan, Brian W. / Ritchie, Dennis M.. The C Programming Language, Second Edition. Prentice-Hall, 1988.
[2 ] Levine, John R. / Mason, Tony / Brown, Doug. Lex & Yacc, Second Edition. O'Reilley & Associates, Inc., 1992.